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Linear mixed models are a powerful statistical tool for identifying genetic associations 
and avoiding confounding. However, existing methods are computationally intractable in 
large cohorts, and may not optimize power. All existing methods require time cost 0(MN 2 ) 
(where N = #samples and M = #SNPs) and implicitly assume an infinitesimal genetic ar- 
chitecture in which effect sizes are normally distributed, which can limit power. Here, we 
present a far more efficient mixed model association method, BOLT-LMM, which requires 
only a small number of O(MN) iterations and increases power by modeling more realistic, 
non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. 
We applied BOLT-LMM to nine quantitative traits in 23,294 samples from the Women's 
Genome Health Study (WGHS) and observed significant increases in power, consistent with 
simulations. Theory and simulations show that the boost in power increases with cohort size, 
making BOLT-LMM appealing for GWAS in large cohorts. 
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Linear mixed models are emerging as the method of choice for association testing in genome- 
wide association studies (GWAS) because they account for both population stratification and cryp- 
tic relatedness and achieve increased statistical power by jointly modeling all genotyped mark- 
ers [1-12]. However, existing mixed model methods still have limitations. First, mixed model 
analysis is computationally expensive. Despite a series of recent algorithmic advances, current al- 
gorithms require 0(MN 2 ) total running time (assuming N < M), where M is the number of markers 
and N is the sample size, a cost that is becoming prohibitive for large cohorts [12]. Second, cur- 
rent mixed model methods fall short of achieving maximal statistical power owing to suboptimal 
modeling assumptions regarding the genetic architectures underlying phenotypes. The standard 
linear mixed model implicitly assumes that all variants are causal with small effect sizes drawn 
from independent Gaussian distributions — the "infinitesimal model" — whereas in reality, complex 
traits have been estimated to have on the order of only a few thousand causal loci [13, 14]. 

Methodologically, efforts to more accurately model non-infinitesimal genetic architectures 
have followed two general thrusts. One approach is to apply the standard infinitesimal mixed 
model but adapt the input data. For example, large-effect loci can be explicitly identified and con- 
ditioned out as fixed effects [7], or the mixed model can be applied to only a selected subset of 
markers [9, 1 1, 15, 16]. A more flexible alternative approach is to adapt the mixed model itself by 
taking a Bayesian perspective and modeling SNP effects with non-Gaussian prior distributions that 
better accommodate both small- and large-effect loci. Such methods were pioneered in livestock 
genetics to improve prediction of genetic values [17] and have been extensively developed in the 
plant and animal breeding literature for the purpose of genomic selection [18]. While modeling 
methods that improve prediction should in theory enable corresponding improvements in statistical 
power of association analysis (via conditioning on other associated loci when testing a candidate 
marker [9, 12]), a challenge of applying Bayesian methods in the GWAS setting is that Bayesian 
statistics are not readily interpretable in the customary hypothesis testing framework. 

Here, we present an algorithm that performs mixed model analysis in a small number of 
0(MN)-time iterations and increases power by modeling non-infinitesimal genetic architectures. 
Our algorithm fits a Gaussian mixture model of SNP effects [19], using a fast variational approx- 
imation [20-22] to compute approximate phenotypic residuals, and tests the residuals for associa- 
tion with candidate markers via a retrospective score statistic [23] that provides a bridge between 
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Bayesian modeling for phenotype prediction and the frequentist association testing framework. 
We calibrate our statistic using an approach based on the recently developed LD Score regression 
technique [24] . The entire procedure operates directly on raw genotypes stored compactly in mem- 
ory and does not require computing or storing a genetic relationship matrix. In the special case of 
the infinitesimal model, we achieve results equivalent to existing methods at dramatically reduced 
time and memory cost. 

We provide an efficient software implementation of our algorithm, BOLT-LMM, and demon- 
strate its computational efficiency on simulated data sets of up to 480,000 individuals. Our simula- 
tions also show that BOLT-LMM achieves increased association power over standard infinitesimal 
mixed model analysis of traits driven by a few thousand causal SNPs. We applied BOLT-LMM 
to perform mixed model analysis of nine quantitative traits in 23,294 samples from the Women's 
Genome Health Study (WGHS) [25] and observed increased association power equivalent to an up 
to 10% increase in effective sample size. We demonstrate through theory and simulations that the 
boost in power increases with cohort size, making BOLT-LMM a promising approach for large- 
scale GWAS. 

Results 

Overview of Methods 

The BOLT-LMM algorithm consists of four main steps, each of which run in a small number of 
0(MN)-time iterations. These steps are: (la) Estimate variance parameters; (lb) Compute in- 
finitesimal mixed model association statistics (denoted BOLT-LMM-inf); (2a) Estimate Gaussian 
mixture parameters; (2b) Compute Gaussian mixture model association statistics (BOLT-LMM). 
Step la computes results nearly identical to standard variance components analysis but applies 
a new stochastic approximation algorithm that reduces time and memory cost by circumventing 
spectral decomposition, which is expensive for large sample sizes. Instead, the approximation 
algorithm only requires solving linear systems of mixed model equations, which can be accom- 
plished efficiently using conjugate gradient iteration [26,27]. Step lb likewise circumvents spec- 
tral decomposition by introducing a new retrospective mixed model association statistic similar to 
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GRAMMAR-Gamma [10] and MASTOR [23], which we compute — up to a calibration constant — 
using only solutions to linear systems of equations. We estimate the calibration constant by com- 
puting and comparing the new statistic and the standard prospective mixed model statistic at a 
random subset of SNPs, which can likewise be accomplished efficiently using conjugate gradient 
iteration. This procedure is similar in spirit to GRAMMAR-Gamma calibration but requires only 
0(MN)-time iterations. 

Steps 2a and 2b are Gaussian mixture parallels of steps la and lb. BOLT-LMM's non-infinitesimal 
model amounts to a relaxation of the standard mixed model, which from a Bayesian perspec- 
tive imposes a Gaussian prior distribution on SNP effect sizes. BOLT-LMM relaxes this model- 
ing assumption by allowing the prior distribution to be a mixture of two Gaussians, giving the 
model greater flexibility to accommodate large-effect SNPs while maintaining effective modeling 
of genome-wide effects (e.g., ancestry). For the Gaussian mixture model, it is no longer tractable 
to perform exact posterior inference, so BOLT-LMM instead computes a variational approxima- 
tion [20-22] that converges after a small number of 0(MN)-time iterations. Step 2a applies this 
method within 5-fold cross-validation to estimate best-fit parameters for the prior distribution (tak- 
ing into account variance parameters estimated in step la) based on out-of-sample prediction accu- 
racy. If the prediction accuracy of the best-fit Gaussian mixture model exceeds that of the infinites- 
imal model by at least a specified amount, step 2b is then run to compute association statistics by 
testing each SNP against the residual phenotype obtained from the Gaussian mixture model and 
calibrating the test statistics against the results of step lb using LD Score regression [24]. Other- 
wise, the BOLT-LMM association statistic is the same as BOLT-LMM-inf. Both step lb and step 
2b are performed using a leave-one-chromosome-out (LOCO) scheme to avoid proximal contami- 
nation [9, 12]. Further details of the method are provided in Online Methods and the Supplementary 
Note. The key properties of BOLT-LMM in terms of speed and model specification are compared 
to existing mixed model association methods in Table 1. 

BOLT-LMM is much more computationally efficient than existing methods 

To analyze the computational performance of BOLT-LMM, we simulated data sets of sizes rang- 
ing up to N = 3,750 to 480,000 individuals and M = 300,000 SNPs. We used genotypes from 
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the WTCCC2 data set [28] analyzed in ref. [12], which contains 15,633 individuals of European 
ancestry, to form mosaic chromosomes, and we used a phenotype model in which 5,000 SNPs 
explained 20% of phenotypic variance (Online Methods). 

We benchmarked BOLT-LMM against existing mixed model association methods, running 
each method for up to 10 days on machines with 96GB of memory. BOLT-LMM completed 
all analyses through N = 480,000 individuals within these constraits, whereas previous methods 
could only analyze a maximum of N = 7,500-30,000 individuals (Fig. 1). All previous methods 
require 0(MN 2 ) running time, whereas BOLT-LMM requires only ^0(MN L5 ) time (Fig. la and 
Supplementary Fig. la). We also observed substantial savings in memory use with BOLT-LMM, 
which requires little more than the MN/4 bytes of memory needed to store raw genotypes, much 
less than existing mixed model methods (Fig. lb and Supplementary Fig. lb). 

The running time of BOLT-LMM depends not only on the cost of matrix arithmetic, which 
scales linearly with M and N, but also the number of 0(MN)-time iterations required for con- 
vergence, which is roughly 0(N° 5 ) and also varies with heritability, relatedness, and population 
structure (see Supplementary Note, Supplementary Fig. 1 and Supplementary Fig. 2). These ob- 
servations apply both to the full Gaussian mixture modeling performed by BOLT-LMM and to 
the subset of the computation (steps la and lb) needed to compute BOLT-LMM-inf infinitesimal 
mixed model association statistics, which in our benchmarks required about 40% of the full BOLT- 
LMM run time (Fig. la and Supplementary Fig. la). Our results show that even on very large data 
sets, BOLT-LMM is efficient enough to enable mixed model analysis using a Gaussian mixture 
prior, which we recommend because of its potential to increase power. 

Simulations: BOLT-LMM increases power while controlling false positives 

To assess the power of BOLT-LMM to detect associated loci, we performed additional simula- 
tions using real genotypes from the WTCCC2 data set, which is an ancestry-stratified sample 
containing both Northern and Southern European samples. We simulated phenotypes with 1250- 
10,000 causal SNPs [13, 14] explaining 50% of phenotypic variance and an additional 60 candidate 
causal SNPs explaining 2% of the variance. We further introduced environmental differences in 
ancestry by including a component of phenotype aligned with the top principal component that 
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explained an additional 1% of the variance. (We note that principal component analysis is not part 
of BOLT-LMM; our recommendation, consistent with the recommendation of ref. [12], is that it is 
not necessary to perform PCA when running mixed model association methods.) We chose causal 
SNPs randomly from the first halves of chromosomes, leaving the second halves of chromosomes 
to contain only non-causal SNPs (Online Methods). 

We computed x 2 association statistics using linear regression with 10 principal components 
(PCA) [29], GCTA-LOCO [12], BOLT-LMM-inf, and BOLT-LMM. We were unable to run FaST- 
LMM-Select [15] because of its memory requirements (Fig. 1). For each method, we computed 
means of its \ 2 statistics over candidate causal SNPs and compared these means across simula- 
tion setups involving different numbers of causal SNPs (Fig. 2a). We observed that BOLT-LMM 
achieved power gains by modeling non-infinitesimal architectures. For the sparsest genetic ar- 
chitecture (1250 causal SNPs plus 60 causal candidate SNPs), we observed a 25% increase in 
mean BOLT-LMM \ 2 statistics at candidate SNPs compared to GCTA-LOCO and BOLT-LMM- 
inf infinitesimal mixed model x 2 statistics. This metric is readily interpretable as corresponding 
to a 25% increase in effective sample size; for completeness, we also computed traditional power 
curves at two significance thresholds (Supplementary Fig. 4). The power gain of the Gaussian 
mixture model decreased with increasing numbers of causal SNPs (Fig. 2a). This behavior is ex- 
pected because the advantage of the Gaussian mixture lies in its ability to more accurately model a 
small fraction of SNPs with larger effects amid a majority of SNPs with near-zero effects. Larger 
numbers of causal SNPs explaining a fixed proportion of variance result in smaller effect sizes per 
causal SNP, giving BOLT-LMM less opportunity for power gain. In contrast, all methods other 
than BOLT-LMM had performance independent of the number of causal SNPs, consistent with the 
fact that none of these methods model non-infinitesimal genetic architectures. GCTA-LOCO and 
BOLT-LMM-inf mean \ 2 statistics at candidate causal SNPs were essentially identical and slightly 
exceeded PCA, consistent with theory [12]. We also tested EMMAX [3] and GEMMA [6], which 
are vulnerable to proximal contamination [9, 12]; these methods suffered loss of power relative to 
PCA (Supplementary Fig. 3a), consistent with theory [12]. 

To further explore the relationship between the magnitude of Gaussian mixture model power 
gain and other parameters of the data set, we also varied the proportion of variance explained 
by causal SNPs (Fig. 2b), and the number of samples in our simulations (Fig. 2c). We observed 
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that the boost in power of BOLT-LMM over infinitesimal mixed model analysis (GCTA-LOCO, 
BOLT-LMM-inf) increased with each of these parameters. In further simulations using data sets 
of size N = 30,000 and N = 60,000 (Online Methods) and simulated phenotypes with M causal = 
250-15,000 causal SNPs explaining 15-35% of the variance, we observed that the effectiveness of 
the Gaussian mixture model is closely tied to hgN / M causal (where hg is heritability explained by 
genotyped SNPs); intuitively, this quantity measures the effective number of samples per causal 
SNP (Supplementary Fig. 5). These results are consistent with theory, which explains that in the 
absence of confounding, both infinitesimal and Gaussian mixture model analysis provide a power 
gain over marginal regression by conditioning on the estimated effects of other SNPs when testing 
a candidate SNP [9, 12]. As sample size increases, the power gain of both methods approaches an 
asymptote corresponding to an increase in effective sample size of 1/(1 -hg), but for sparse genetic 
architectures, the Gaussian mixture model can approach this asymptote much faster. 

To verify that BOLT-LMM is correctly calibrated and robust to confounding, we also computed 
mean % 2 statistics across SNPs on the second halves of chromosomes, simulated to all have zero 
effect ("null SNPs"). Because our simulated phenotypes included an ancestry effect, linear regres- 
sion without correcting for population stratification suffered 35% inflation. In contrast, the BOLT- 
LMM and BOLT-LMM-inf statistics were both well-calibrated (Supplementary Fig. 3b, Supple- 
mentary Table 3, and Supplementary Table 4). We further verified that Type I error was properly 
controlled (Online Methods and Supplementary Table 5) and that the distribution of statistics at 
null SNPs did not deviate noticeably from a chi-square with 1 degree of freedom (Supplementary 
Fig. 6a,b). Genomic inflation factors [30]) for BOLT-LMM and BOLT-LMM-inf exceeded 1 in 
these simulations (Supplementary Fig. 6c,d), consistent with polygenicity of the simulated pheno- 
type and use of a mixed model statistic that successfully avoids proximal contamination [12, 13]. 
In contrast, EMMAX and GEMMA had deflated test statistics (Supplementary Fig. 3b). 

Finally, we investigated the similarity between the BOLT-LMM-inf infinitesimal mixed model 
statistic versus existing methods at the level of individual SNPs. Despite its use of an infinitesimal 
model, the BOLT-LMM-inf statistic is not identical to any existing mixed model statistic because it 
is an approximate retrospective test statistic that avoids proximal contamination (Online Methods 
and Table 1). Nonetheless, we observed that BOLT-LMM-inf statistics very nearly match GCTA- 
LOCO statistics (which use the standard prospective model), with R 2 > 0.999 (Supplementary 
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Table 6 and Supplementary Fig. 7). 

BOLT-LMM increases power to detect associations for WGHS phenotypes 

To assess the efficacy of Gaussian mixture model analysis for increasing power on real phenotypes, 
we analyzed nine phenotypes in the Women's Genome Health Study (N = 23,294 samples, M = 
324,488 SNPs after QC) (Online Methods). These phenotypes consisted of five lipid phenotypes, 
height, body mass index, and two blood pressure phenotypes; we chose to analyze these pheno- 
types because of the availability of published lists of associations from large-scale GWAS of these 
traits. 

We compared the power of three association tests: linear regression with 10 principal compo- 
nents (PCA) [29], infinitesimal mixed model analysis with BOLT-LMM-inf, and Gaussian mixture 
modeling with BOLT-LMM. Because of memory constraints (Fig. 1), we were unable to run either 
GCTA-LOCO [12], FaST-LMM [5], or FaST-LMM- Select [15], which are the only previous meth- 
ods that avoid proximal contamination (Table 1); however, GCTA-LOCO and BOLT-LMM-inf 
statistics are essentially identical (Supplementary Table 6 and Supplementary Fig. 7). To compare 
power among these methods, we computed two roughly equivalent metrics: mean x 2 statistics at 
known associated loci, a direct but somewhat noisy approach due to having only 19-180 loci for 
each trait (Supplementary Table 7), and out-of-sample prediction R 2 (measured in cross-validation) 
using all SNPs for the mixed model methods and using only PCs for linear regression. For mixed 
model analysis, the latter approach estimates the ability of the mixed model to condition on effects 
of other SNPs when testing a candidate SNP, which drives its power (Online Methods) [12,31]. 

BOLT-LMM achieved higher power than PCA for all traits studied (Fig. 3 and Supplementary 
Table 8). Most of the increase was due to gains over infinitesimal mixed model analysis, with 
the magnitude of this power gain increasing with inferred concentration of genetic effects at few 
loci (Supplementary Table 9). The standard errors of the direct method of assessing improvement 
(mean \ 2 at known loci) were somewhat high (0.6-2.2%; Fig. 3a and Supplementary Table 8), 
so the improvement was statistically significant (p<0.05) for only 6 of the 9 traits. On the other 
hand, all of the improvements were statistically significant (p<0.0002) according to the prediction 
R 2 metric (Fig. 3b and Supplementary Table 8). The largest gains were achieved for lipid traits; 
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for ApoB, a lipoprotein closely related to LDL cholesterol, BOLT-LMM analysis achieved a 10% 
increase in mean \ 2 statistics versus PCA and a 9% increase versus infinitesimal mixed model 
analysis at known loci. To verify that these increases were not merely driven by a few loci with the 
largest effects, we also computed flat averages across loci of improvements in \ 2 statistics (restrict- 
ing to loci replicating in WGHS with at least nominal p < 0.05 significance to reduce statistical 
noise), obtaining consistent results (Supplementary Table 7). As noted above, simulations show 
that these improvements will increase with sample size (Fig. 2c and Supplementary Fig. 5). 

We also observed that infinitesimal mixed model analysis achieved statistically significant 
power gains over PCA, with the magnitude of the power gains increasing with the heritability 
parameter estimated by BOLT-LMM (Fig. 3 and Supplementary Table 8), which we refer to as 
pseudo-heritability (hp Seudo ; see Online Methods), following ref. [3]. For height (hp SCudo = 0.47 
in WGHS), the moderately large sample size of WGHS (N = 23,294) was enough to obtain a 6% 
increase in BOLT-LMM-inf x 2 statistics versus PCA, consistent with theory [12,31]. Once again, 
larger sample sizes will enable further gains [12,31]. 

To verify that BOLT-LMM successfully corrected for confounding from population structure 
in WGHS, we computed mean \ 2 statistics across all typed SNPs and genomic inflation factors for 
the three methods compared above as well as uncorrected marginal linear regression. We observed 
that PCA, BOLT-LMM-inf, and BOLT-LMM statistics were consistently calibrated, while uncor- 
rected linear regression statistics were inflated, especially for height (Supplementary Table 10). We 
further verified that genetic variation at the lactase gene had a false-positive genome-wide signifi- 
cant association with height using uncorrected marginal regression [32], which disappeared when 
using PCA, BOLT-LMM-inf, and BOLT-LMM (Supplementary Table 11). 

Discussion 

We have described a new algorithm for fast Bayesian mixed model association, BOLT-LMM, and 
demonstrated that it has time complexity ^0(MN L5 ) and requires only ^MN/4 bytes of memory, 
resulting in orders-of-magnitude improvements in computational efficiency over existing methods 
for large data sets. We have further shown in simulations and analyses of WGHS phenotypes that 
the Gaussian mixture modeling capability of BOLT-LMM enables increased association power 
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over standard mixed model analysis while controlling false positives. Among WGHS lipid traits, 
we observed power increases equivalent to increases in effective sample size of up to 10% over 
PC A and 9% over standard mixed model analysis. 

BOLT-LMM is an advance for two main reasons. First, as sample sizes continue to increase, 
mixed model analysis is simultaneously becoming more important — in order to correct for pop- 
ulation structure and cryptic relatedness in very large data sets — but lees practical with existing 
methods, all of which have >0(MN 2 ) time complexity and high memory requirements. The al- 
gorithmic innovations of BOLT-LMM overcome this computational barrier. Second, the ability of 
BOLT-LMM to better model non-infinitesimal genetic architectures enables a power gain relative 
to standard mixed model analysis. Recent methodological progress in this direction includes the 
multi-locus mixed model (MLMM) [7], which identifies and conditions out large-effect loci as 
fixed effects, and FaST-LMM-Select and related methods [9, 11, 15, 16,33], which adopt a sparse 
regression framework that restricts the mixed model to a subset of markers. However, these meth- 
ods all face the same 0(MN 2 ) computational hurdle as standard mixed model analysis. 

Bayesian methods have previously been developed that apply non-infinitesimal models to im- 
prove the accuracy of genetic risk prediction, but translating Bayes factors and posterior inclusion 
probabilities into the frequentist hypothesis testing framework favored by the GWAS community 
is a challenge [34]. The variational Bayes spike regression (vBsr) method [35] is a recent step to- 
ward addressing this issue, proposing a z-statistic heuristically calibrated by assuming that the vast 
majority of variants are unassociated (as in genomic control [30]), but such a technique is prone 
to deflation when large sample sizes cause inflation due to polygenicity [13,24]. BOLT-LMM 
sidesteps this difficulty via its hybrid approach of leaving each chromosome out in turn, fitting 
a Bayesian model on the remaining SNPs, and then applying a retrospective hypothesis test for 
association of left-out SNPs with the residual phenotype. In contrast to than modeling all SNPs 
simultaneously and assessing evidence for association using Bayesian posterior inference [34], our 
approach generalizes existing mixed model methods that are widely used, and we believe its ability 
to harness the power of Bayesian analysis while still computing frequentist statistics will be use- 
ful to GWAS practitioners. Additionally, such a hybrid approach lends itself readily to efficiently 
testing millions of imputed SNP dosages for association while including only typed SNPs in the 
mixed model, which we recommend in order to limit computational costs. 
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While BOLT-LMM improves upon existing mixed model association methods in both speed 
and power, BOLT-LMM still has limitations. First, the power gain that BOLT-LMM offers over 
existing methods via its more flexible prior on SNP effect sizes is contingent on the true genetic 
architecture being sufficiently non-infinitesimal and the sample size being sufficiently large (Sup- 
plementary Fig. 5). Second, BOLT-LMM, like existing mixed model methods, is susceptible to loss 
of power when used to analyze large ascertained case-control data sets in diseases of low preva- 
lence [12]. We recommend BOLT-LMM for randomly ascertained quantitative traits, ascertained 
case-control studies of diseases with prevalence >5% (e.g., type 2 diabetes, heart disease, com- 
mon cancers, hypertension, asthma) (see Supplementary Table 12), and studies of rarer diseases 
in large, non-ascertained population cohorts [36,37]. For large ascertained case-control studies 
of rarer diseases, we are developing a method of modeling ascertainment using posterior mean 
liabilities [38]; applying the techniques of BOLT-LMM to these posterior mean liabilities is an 
avenue for future research. Third, while mixed model analysis is effective in correcting for many 
forms of confounding, performing careful data quality control remains critical to avoiding false 
positives. Fourth, our work does note estimate the heritability explained by genotyped SNPs (h^; 
ref. [39]) — because hp SCudo may be different from hg (see Online Methods) — and does not con- 
duct or evaluate genetic prediction in external validation samples from an independent cohort [31]. 
Fifth, we have not studied the performance of mixed model methods in data sets dominated by 
family structure [23]; this will be investigated elsewhere. Sixth, while BOLT-LMM extends the 
infinitesimal model by generalizing the SNP effect prior to a mixture of two Gaussians, other pri- 
ors are possible and may be more appropriate for some genetic architectures (Table 1 of ref. [19]). 
Seventh, the running time of BOLT-LMM scales with the number of phenotypes analyzed; for data 
sets with a very large number of phenotypes (P), the GRAMMAR-Gamma method [10], which has 
running time 0(MN 2 +MNP) (reviewed in ref. [12]) may be faster. Finally, we have developed fast 
mixed model analysis for a mixed model with one random genetic effect; extending the algorithm 
to model multiple variance components [40] is a direction for future work. 

URLs. BOLT-LMM software, http : / /www . hsph . harvard . edu/alkes -price /software/. 
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Online Methods 

Standard mixed model association methods. Standard methods employ a model 

y = z t txtPtMt + 9 + e, (1) 

where y is the phenotype, x test is the candidate SNP being tested, g is the genetic effect, and 
e is the environmental effect. Assuming sample size N, all of the above are iV-vectors. We 
assume for now that all have been mean-centered and there are no covariates; we treat covariates by 
projecting them out from both genotypes and phenotypes (see Supplementary Note). The genetic 
and environmental effects are modeled as random effects, while the candidate SNP is modeled as 
a fixed effect with coefficient /5 test , and the goal is to test the null hypothesis /3 test = 0. Under the 
standard infinitesimal model, the genetic effect is modeled as 

9 = ^grm/^grm, (2) 

where X GRM is an N x M GRM matrix, each column of which contains normalized genotypes 
corresponding to a SNP included in the model, and /3 GRM is an M GRM -vector of random SNP effect 
sizes all drawn from the same normal distribution, so that g has a multivariate normal distribution 
with covariance Cov(g) oc X GRM X GRM ' . Note that in order to avoid proximal contamination [9, 
12], the (M GRM ) SNPs used in X GRM should vary depending on which SNP x test is being tested: the 
candidate SNP x test (and SNPs in linkage disequilibrium with it) should be excluded from X GRM 
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to avoid modeling its effect twice. BOLT-LMM adopts the leave-one-chromosome-out (LOCO) 
scheme of GCTA-LOCO [12], in which X GRM contains all genotyped SNPs except for those on 
the same chromosome as x test . 

Conventionally, the matrix Xgrm^grm'/^grm is referred to as the genetic relationship matrix 
(GRM) or empirical kinship matrix K, and we write 

Cav(g) = a 2 X GRM X GRM '/M GRM = a 2 g K, (3) 

where a 2 is a variance parameter. Environmental effects are assumed to be i.i.d. normal, so e is 
also multivariate normal with 

Cov(e) = a 2 J, (4) 

where / denotes the N x N identity matrix and a 2 is another variance parameter. 

In practice, the variance parameters a 2 and a 2 are unknown. Several existing methods [3, 10, 12] 
therefore take a two-step approach to computing association statistics: first estimate the variance 
parameters (with the SNP x test removed from the model) using restricted maximum likelihood 
(PvEML), and then compute the prospective chi-squared (1 d.o.f.) test statistic (as previously pro- 
posed in family-based tests [41]) 

2 _ (x t JV- l y) 2 

X LMM — JT7ZI ) V>) 

Xtest v ^test 

where 

V = Cov(y) = a 2 g K + a 2 e I, (6) 

assuming the variance parameters a 2 and a 2 have been fixed to their estimates under the null 
hypothesis /3 test = 0. Recent computational advances have also enabled computation of exact like- 
lihood ratio test statistics that model the variance parameters while testing the candidate SNP [6,9]. 
While exact statistics are more accurate in situations with very large-effect SNPs, the approximate 
methods produce near-identical results in typical human genetics scenarios [3, 10, 12]. In the case 
of GCTA-LOCO [12], we note that the test statistic becomes 

2 (XtJVLOCO^V) 2 , n , 

X LMM — 777 — [ , {') 

^test V LOCO ^test 
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where we have written I^loco for V to indicate that the chromosome containing x test is left out of 
the GRM. 

BOLT-LMM-inf mixed model statistic. The BOLT-LMM-inf infinitesimal mixed model statis- 
tic is slightly different: 

2 Otest' ^LOttT 1 ?/) 2 /c , 

X BOLT-LMM-inf — , (o) 

Qnf 

where q n f is a constant calibration factor estimated as 

Mean (aW^LOco" 1 ?/) 2 , Q . 

c inf — , , 2 ' 

Mean x^lmm-loco 

so that 

Mean x 2 b OLT-LMM-inf = Mean x 2 lmm-loco , (10) 

where 

2 _ (^test'^LOCO^ 1 !/) 2 nl * 

X lmm-loco — -pry zi v-L-U 

^test V LOCO ^test 

denotes the LOCO version of x 2 lmm- In practice, for computational efficiency, we take means 
over 30 random SNPs not significantly associated with the phenotype (x 2 < 5 estimated with the 
GRAMMAR statistic [42]). (For replicability, the BOLT-LMM software uses a pseudo-random 
number generator to select the "random" SNPs in a deterministic manner so that final results are 
always exactly the same for different runs.) We have observed empirically in simulations and 
analyses of real phenotypes that 30 random SNPs are enough to estimate the calibration factor to 
within 1% (Supplementary Table 13). 

We can view the BOLT-LMM-inf statistic either as an approximation of the standard prospec- 
tive statistic or as a retrospective quasi-likelihood score statistic. The first perspective is motivated 
by the observation that the denominator of the prospective statistic in equation (5), x le ^V^ l x test , 
is nearly independent of the SNP x tes t being tested [10]. From this perspective, BOLT-LMM-inf is 
similar to GRAMMAR-Gamma [10], with two key differences: (1) BOLT-LMM-inf is computed 
via a much faster algorithm (described below) to perform the initial variance components analysis 
and estimate the calibration constant, and (2) BOLT-LMM-inf avoids proximal contamination via 
LOCO analysis. 
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Retrospective formulation of BOLT-LMM-inf association test. Alternatively, we can also 
view BOLT-LMM-inf as a retrospective quasi-likelihood score test similar to t score ~ r [43] and 
MASTOR [23]. Following ref. [23], this viewpoint models the phenotype y as known and the SNP 
a: tes t as random, being drawn from a distribution with covariance Cov(x test ) = $ and mean 

E[x test | y\ = (§V- l y)a } (12) 



where we wish to test the null hypothesis a = 0. Note that x tes t is a discrete random variable, 
so we cannot formally model it as a multivariate normal and perform likelihood-based analysis, 
but the retrospective mean model is enough to obtain a quasilikelihood score test statistic that is 
asymptotically ^-distributed: 

(x^\<$>V- l y)) 2 (x^V-'y? 

Previous formulations of this statistic [23,43] have been aimed at family studies in which the 
pedigree matrix may be used as the covariance matrix $. For the general case of individuals with 
unknown pedigree, a seemingly natural choice is to substitute the GRM, but unfortunately, the 
GRM does not accurately reflect the covariance matrix of the distribution of unassociated (i.e., 
null) SNPs. In particular, for large sample sizes, this approach incurs deflation because it incor- 
porates SNPs associated with y into the covariance model (for x test , treated as random), causing 
overestimation of the variance of x test V~ 1 y. This phenomenon is not overcome by LOCO analysis 
because unlike with prospective modeling, the problem is not proximal contamination; rather, it 
is polygenicity: to obtain unbiased variance estimates based on genotype data, we would need to 
eliminate associated SNPs, which runs into the same difficulty faced by genomic control [13,24]. 

While the denominator of x 2 lmm-r in equation (13), (V~ 1 y)'&(V~ 1 y), cannot be computed 
without $, it does not involve the candidate SNP x test and can therefore be treated as a constant 
calibration factor, so that in practice, equation (13) is equivalent to the BOLT-LMM-inf formula- 
tion in equation (8). (More precisely, within a LOCO scheme, the matrix V depends on which 
chromosome x test belongs to, so there should formally be a different denominator for each chro- 
mosome, but all are approximately equal.) The retrospective model thus provides an alternative 
justification for the BOLT-LMM-inf statistic. 
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BOLT-LMM Gaussian mixture model association statistic. Importantly, the retrospective view- 
point also enables a natural generalization of BOLT-LMM-inf to a more powerful association test 
for non-infinitesimal phenotypes. The basic observation motivating this extension is that in the 
retrospective mixed model statistic x 2 lmm-r, the vector V~ x y is a scalar multiple of the residual 
after best linear unbiased prediction (BLUP) — i.e., the component of the phenotype remaining af- 
ter conditioning out SNP effects estimated by the mixed model, which drives the power of mixed 
model association — but this vector may be replaced by any other vector not involving x test . We 
therefore define a generalized statistic 

2 ( X t est Z/resid-LOCOj 
X BOLT-LMM — , (.14) 

C 

where y r esid-Loco denotes the residual phenotype upon conditioning out the effects of SNPs not on 
the same chromosome as x test , estimated by fitting a Gaussian mixture extension of the standard 
LMM, and c denotes a constant calibration factor, estimated so that the LD Score regression in- 
tercept [24] of bolt-lmm matches that of the calibrated x 2 BOLT-LMM-inf statistic. When using the 
infinitesimal model, y res i d -Loco is proportional to V^y, and x 2 bolt-lmm reduces to x 2 b oLT-LMM-inf • 
The Gaussian mixture extension of the standard LMM is defined as follows. First, note that the 
null model of the standard LMM used by BOLT-LMM-inf admits the Bayesian formulation 

y = X LOC o/3loco + e, (15) 

where SNP effects (3 m (where m indexes SNPs not on the left-out chromosome) are independently 
drawn from the normal distribution 

/3 m ~iV(0,<7 2 )/M LOC o (16) 

and environmental effects e n (where n indexes samples) are independently drawn from the normal 
distribution e n ~ N(0, cr 2 ). Performing best linear unbiased prediction amounts to computing the 
posterior mean of the genetic effect X LO co/^loco- 

To generalize this model to non-infinitesimal genetic architectures, we replace the normal prior 
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on SNP effect sizes with a spike-and-slab mixture of normals [19]: 




N(0, a 2 g x ) with probability p 



N(0, tjp 2 ) with probability I — p. 



(17) 



The rationale behind using a mixture of normals is that it more flexibly models the heavier-tailed 
distributions of genetic effects of typical (non-infinitesimal) phenotypes. Explicitly, if p <C 1 and 
°i i ^ a \ 2 small, the first component of the mixture is a "slab" that models the existence of a 
small number of relatively large-effect loci, while the second component is a "spike" that models 
the assumption that most SNPs have near- zero — but not exactly zero — effect on the phenotype. It 
is important that the second component not be a pure spike with zero variance, because assign- 
ing a small amount of variance to the spike component allows the model to capture components 
of phenotype explained by aggregate genome- wide effects such as ancestry or relatedness. Conse- 
quently, when testing SNPs for association with the residual phenotype, these genome- wide effects 
are appropriately conditioned out, making BOLT-LMM robust to confounding. 

BOLT-LMM estimates best-fit mixture parameters to use in the Gaussian mixture model via 
cross-validation; details are given in the algorithm description below. We can now finally complete 
the definition of the x 2 bolt-lmm statistic in equation (14): using the fitted mixture parameters, we 
fit the Bayesian model (once per left-out chromosome) and obtain residuals 



where /9loco are the estimated posterior mean effect sizes. Plugging these residuals into equation 
(14) gives the BOLT-LMM Gaussian mixture model association test statistic. 

Fast iterative algorithm. As described in Results (Overview of Methods), the BOLT-LMM soft- 
ware performs a four-step computation for mixed model association analysis, stopping after the 
first two steps when specialized to the infinitesimal model. We outline the main algorithms here 
and provide full details in the Supplementary Note. 



2/resid-LOCO — V ~ -^LOCO/^LOCO 



(18) 
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Step la: Estimate variance parameters. A key advance of BOLT-LMM is its ability to perform 
REML variance components analysis (for the special case of one identity and one non-identity vari- 
ance component) — i.e., estimate a 2 and a 2 in equations (3) and (4) — using linear-time iterations 
without building or decomposing any covariance matrices. Our algorithm relies on rewriting or 
approximating all 0(MN 2 )-time matrix computations in terms of solutions to linear systems of 
equations (i.e., expressions of the form V^x), which we in turn compute using only O(MN)- 
time matrix-vector products via conjugate gradient iteration. In particular, computing V~ l x using 
conjugate gradient iteration only requires computing products of V with vectors, which we can 
compute in 0(MN)-time without forming V by leaving V in factored form a 2 XX' /M + a 2 I, 
distributing the product, and multiplying from right to left. 

To estimate variance parameters using this approach, we apply the observation that the REML 
first-order conditions on a 2 and a 2 are equivalent to the system of equations 

£[E = E ^E Sod] = E ^a, (19) 

where both the left and right sides of each equation are functions of the assumed variance pa- 
rameters a 2 and a 2 (equation (14.8) of ref. [44]). The right sides involve BLUP predictions on 
the observed phenotype data y (performing BLUP using the assumed a 2 and a 2 , which can be 
done using conjugate gradient iteration), while the left sides are expectations of the same quan- 
tities over random y generated according to the model y = g + e with variance parameters set 
to the assumed a 2 and a 2 . We can therefore estimate the left sides (for a particular choice of 
a 2 and a 2 ) by Monte Carlo sampling, i.e., by generating random y mnd = X/3 rand + e rand from 
A-and,™ ~ -^"(0, a 2 /M), e randin ~ iV(0, a 2 ) and then performing BLUP on y mnd using the assumed 
a 2 and a 2 . Only a few Monte Carlo trials are required; BOLT-LMM computes 3-15 trials, with 
the number of trials decreasing as iV increases, reflecting the way in which the standard error de- 
pends on sample size and the number of trials. Explicitly, BOLT-LMM by default uses 4 x 10 9 /iV 2 
trials, with a minimum of 3 and a maximum of 15. The number of trials can also be varied as a 
user-specified parameter. 

While the above procedure as stated requires a 2-parameter search over a 2 and a 2 , we can 
eliminate a 2 , leaving a 1 -parameter zero-finding problem over the parameter 5 = a 2 /a 2 [45]. 
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BOLT-LMM performs the zero-finding numerically using secant iteration, a finite difference ap- 
proximation of Newton's method. Finally, we note that satisfying the first-order conditions is not 
in general a sufficient condition to have found the REML global optimum, but for the case of only 
one non-identity variance component, the likelihood surface appears to be well-behaved, and we 
have never observed an instance in which this estimation procedure failed. 

Step lb: Compute and calibrate BOLT-LMM-inf statistics. Using the variance parameters 
estimated in step la, it is straightforward to compute (for each LOCO rep) the quantity V W co~ l y 
in the numerator of the BOLT-LMM-inf statistic, equation (8), by using conjugate gradient iteration 
as described above. Completing the computation of the numerator for each SNP x test then just 
amounts to a calculating a dot product, which requires only O(MN) additional cost across all SNPs. 
Similarly, to compute the calibration constant q n f in equation (9), BOLT-LMM rapidly computes 
the prospective statistic x 2 lmm-loco from equation (1 1) at 30 random SNPs by applying conjugate 
gradient iteration to compute V LO co~ la; test for each of the 30 selected SNPs x test . 

Note that there is a slight mismatch between the variance parameters computed in step la, 
which BOLT-LMM computes once using all SNPs — not leaving any chromosomes out — and the 
theoretically optimal variance parameters for the infinitesimal mixed model fit in each LOCO rep, 
which vary slightly depending on which chromosome is left out. However, we have observed 
in simulations that slight mis- specification of the variance parameters has a negligible impact 
(<0.5%) on the calibration of the BOLT-LMM-inf and BOLT-LMM statistics (Supplementary 
Table 4). Because very slight miscalibration is not a concern for confounding from population 
stratification at highly differentiated markers (Supplementary Table 11) and has little impact on 
Type I error (Supplementary Table 5), the BOLT-LMM software does not by default refit variance 
parameters for each LOCO rep. If extremely precise calibration is desired, we provide runtime op- 
tions to refit variance parameters for each LOCO rep or to partition the genome more finely (e.g., 
into 100 segments rather than 22), each at the cost of a factor of 2-3 in running time. 

Step 2a: Estimate Gaussian mixture prior parameters. The first step of BOLT-LMM Gaus- 
sian mixture model association analyis is to estimate parameters for the mixture of two Gaussians 
used as the Bayesian prior on SNP effect sizes. As written in equation (17), this mixture has three 
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parameters: <t\ 1 and er| 2 , the variances of the two Gaussians, and p, the probability of drawing 
from the first Gaussian. To reduce the complexity of parameter estimation, we constrain the total 
variance of the mixture to equal the variance cr^/M estimated for SNP effects by the infinitesimal 
model in step la: 

pal 1 + (l-p)al 2 = a 2 g /M. (20) 

We reparameterize the remaining two degrees of freedom using the parameters p and f 2 , where 
f 2 denotes the proportion of the total mixture variance within the second Gaussian (the "spike" 
component that models small genome- wide effects): 

f _ (I-PK2 (2U 

J 2 ~ — 2 — r~7i \ 2 ■ ^ li > 

p*h + (1 - p) a p,2 

Because the model fit is not sensitive to the precise values of the mixture parameters, we test a 
discrete set of model parameter combinations: 

h e {0.5,0.3,0.1} 
p E {0.5,0.2.0.1,0.05,0.02,0.01}. 

Note that f 2 = 0.5, p = 0.5 corresponds to the infinitesimal model: when f 2 = 1 — p, the two 
Gaussians are identical and the mixture is degenerate. We bounded f 2 from below to ensure that at 
least a small amount (10%) of the mixture variance is assigned to the spike component, protecting 
against confounding from genome- wide effects. We bounded p from below to prevent the model 
from trying to fit too strongly to a few SNPs, which makes Bayesian analysis computationally 
difficult and also increases susceptibility to confounding. 

BOLT-LMM performs model selection among the 18 possible parameter pairs (f2,p) by per- 
forming cross-validation to optimize mean-squared prediction R 2 . More precisely, for each param- 
eter pair, for each cross-validation fold, BOLT-LMM estimates posterior mean SNP effect sizes f3 m 
on the training data (using a variational Bayes procedure, described in the next paragraph, to fit 
the model) and uses these effect size estimates to make predictions on the left-out fold, which are 
then compared to the actual left-out phenotype values. The parameter pair (f2,p) producing the 
highest prediction R 2 (mean across folds) is then selected. However, if the best model only reduces 
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prediction error by less than 1% (in absolute prediction R 2 ) compared to the infinitesimal model, 
BOLT-LMM terminates the Gaussin mixture analysis and does not proceed to step 2b to compute 
association statistics. We implement this feature as the default option (which may be overriden) 
to save unnecessary computation in situations where the Gaussian mixture model provides little or 
no improvement in association power. 

BOLT-LMM uses a variational approximation algorithm to fit Bayesian linear regressions with 
Gaussian mixture priors. Approximation methods are necessary for Bayesian inference in this 
setting because analytic formulas for exact posterior means involve intractable integrals. We apply 
a fully factored variational approximation [21,22,35] that repeatedly loops through the SNPs, 
updating the estimated effect size of each SNP with its posterior mean conditional on current 
estimates of all other SNP effects. Explicitly, we begin by initializing each SNP effect (3 m to 0 
and initializing the residual phenotype y resid = y — X/3 to y. Then, each iteration performs the 
following loop over SNPs m — 1, . . . , M: 

1 . Remove effect of SNP m from residual. y res i d _ m «— y res i d + x m (3 m 

2. Re-estimate effect of SNP m. /3 m <— posterior mean given y resid _ m 

3. Replace effect of SNP m in residual. y re sid <— 2/resid-m — x m (3 m 

This iteration has also previously been termed "iterative conditional expectation (ICE)" [20]. The 
variational Bayes framework puts this iteration on a sound theoretical footing as an optimization 
of an approximate log likelihood function; the iteration monotonically increases this function and 
is guaranteed to converge [46]. In fact, we show that the optimization can be reformulated as 
cyclic coordinate descent applied to a penalized regression problem arising from Bayesian linear 
regression using a transformed prior (Supplementary Note). The approximate log likelihood also 
serves as a convenient convergence criterion: BOLT-LMM stops the iteration when the absolute 
improvement in approximate log likelihood from one update cycle through all the SNPs drops 
below 0.01. 

One difference between the algorithm BOLT-LMM applies and previous variational approaches 
to genetic modeling is that we do not estimate model hyperparameters within the variational itera- 
tion [22,35] or based on variational approximate log likelihoods [21]. Instead, BOLT-LMM uses 
cross-validation [15], which we found to be more robust to decreases in the quality of the varia- 
tional approximation caused by linkage disequilibrium. 
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Step 2b: Compute and calibrate BOLT-LMM Gaussian mixture model statistics. After in- 
ferring parameters of the mixture prior in step 2a, BOLT-LMM uses the same variational Bayes 
algorithm to estimate posterior mean residuals 2/ res id-Loco (independently for each left-out chromo- 
some). The numerators of the BOLT-LMM Gaussian mixture model statistic given in equation (14) 
are then easily obtained as dot products with test SNPs, leaving only the constant calibration factor 
c in the denominator to be calculated. Unlike in the case of the infinitesimal model, here we do 
not have a prospective statistic to calibrate against, so we instead apply LD Score regression [24]. 
LD Score regression observes that for complex traits, properly calibrated x 2 association statistics 
approximately obey the following linear trend (on average across the genome): 

X 2 ~ 1 + (constant) • (LD Score), (22) 

where loosely speaking, the LD Score of the test SNP is its effective number of LD partners. 
Because the intercept of this regression relation is 1, it follows that if we have computed a set of x 2 
statistics up to an unknown constant factor c, then one way to estimate c is to compute the intercept 
of the regression of x 2 statistics on LD Scores. 

To achieve very precise calibration, BOLT-LMM adopts a slightly more complex procedure 
that matches the intercept of LD Score regression on x 2 bolt-lmm (after calibration) to that of the 
properly-calibrated infinitesimal statistic x 2 b OLT-LMM-inf • We do so to increase robustness of the 
calibration to violations of the modeling assumptions underlying LD Score regression, which 
may result in attenuation bias (i.e., inflation of the intercept toward the mean x 2 ) mat causes 
the straightforward LD Score correction to association statistics to be conservative, though still 
much less conservative than genomic control. Because LD Score regression on either x 2 bolt-lmm 
or X 2 BOLT-LMM-inf should be affected roughly equally by attenuation bias, matching LD intercepts 
guards against this potential problem. 

Simulations. We simulated data sets from the WTCCC2 data with the goal of creating realistic 
test scenarios including linkage disequilibrium and possible confounding from population strati- 
fication or relatedness. The WTCCC2 data set contains 15,633 samples (10,204 cases and 5,429 
controls) genotyped at 360,557 SNPs after QC; full details are given in ref. [12]. We simulated 
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genotype data with N = 3,750 to 480,000 individuals and M = 150,000 or 300,000 SNPs by sub- 
sampling the desired number of SNPs from the WTCCC2 data set and then independently building 
N individuals as mosaics of individuals from the original data set using the following procedure. 
First, specify a number of "ancestors" A for each individual to have. Then, for each simulated indi- 
vidual, select A ancestors at random from among the original individuals and create the simulated 
individual's genotype data by chopping the genome into segments of 1000 SNPs and copying each 
segment from a randomly sampled (diploid) ancestor. 

Note that this approach retains realistic LD among SNPs, and for small values of A, it also 
retains population structure and introduces relatedness among individuals that share ancestors. In 
our simulations, we used A = 2 (substantial structure and relatedness) or A = 10 (low levels of 
structure and relatedness). 

We simulated phenotypes as sums of up to four components: genetic effects from a specified 
number of causal SNPs, genetic effects from a specified number of candidate causal SNPs, pop- 
ulation stratification, and environmental effects. We simulated genetic effects by selecting SNPs 
at random from the first half of the list of typed SNPs in each chromosome and at least 2Mb and 
2cM before the middle SNP (Likewise, when assessing calibration at null SNPs, we only tested 
SNPs in the second halves of chromosomes and at least 2Mb and 2cM beyond the middle SNP, to 
avoid contamination of null SNPs from LD with causal SNPs.) We sampled effect sizes for the 
selected normalized SNPs from a standard normal distribution. We simulated population strati- 
fication by taking the first principal component of the normalized genotype matrix. We sampled 
environmental effects for all individuals from a standard normal. To create phenotypes with speci- 
fied proportions of variance explained by each of these effects, we normalized each component (an 
iV-vector across samples) and formed weighted sums using the desired weights. 

Because causal SNPs in our simulated phenotypes are selected only from among genotyped 
SNPs, we computed reference LD Scores (used to calibrate the BOLT-LMM statistic in our sim- 
ulations) by summing LD only to typed SNPs — as opposed to all SNPs, which is appropriate for 
real phenotypes [24] — in 379 European-ancestry samples from the 1000 Genomes Project [47]. 

WGHS data set. The Women's Genome Health Study (WGHS) is a prospective cohort of ini- 
tially healthy, female North American health care professionals at least 45 years old at baseline 
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representing participants in the Women's Health Study (WHS) who provided a blood sample at 
baseline and consent for blood-based analyses. The WHS was a 2x2 trial beginning in 1992-1994 
of vitamin E and low dose aspirin in prevention of cancer and cardiovascular disease with about 
10 years of follow-up. Since the end of the trial, follow-up has continued in observational mode. 
Genotyping in the WGHS sample was performed using the HumanHap300 Duo "+" chips or the 
combination of the HumanHap300 Duo and iSelect chips (Illumina, San Diego, CA) with the In- 
finium II protocol. For quality control, all samples were required to have successful genotyping 
using the BeadStudio v. 3.3 software (Illumina, San Diego, CA) for at least 98% of the SNPs. A 
subset of 23,294 individuals were identified with self-reported European ancestry that could be 
verified on the basis of multidimensional scaling analysis of identity by state using 1443 ancestry 
informative markers in PLINK v. 1.06. In the final data set of these individuals, a total of 339,596 
SNPs were retained with MAF >1%, successful genotyping in 90% of the subjects, and devia- 
tions from Hardy-Weinberg equilibrium not exceeding p = 1CT 6 in significance. In our analyses, 
we further eliminated non-autosomal SNPs, duplicate SNPs, and custom SNPs, leaving 324,488 
SNPs. 

Correspondence between association power and prediction accuracy. Intuitively, mixed model 
analysis to test a candidate marker for association gains power over linear regression (assuming no 
confounding) by conditioning on other markers modeled as random effects. This intuition is es- 
pecially clear in the BOLT-LMM statistical formulation, where we retrospectively test candidate 
markers for association with residual phenotypes. Residualizing eliminates the component of phe- 
notype that was successfully predicted by other markers, so that the subsequent association test is 
trying to detect candidate marker association signal amid phenotypic variance ("noise") that has 
been decreased by a factor of l-R 2 . Note that residualizing leaves the amount of candidate marker 
signal unchanged, assuming no proximal contamination and assuming a randomly ascertained phe- 
notype; violation of these assumptions results in loss of power [12]. 

Quantitatively, a reduction of association test noise by a factor of l-R 2 results in an increase 
in x 2 statistics by a factor of \I(\-R 2 ) at truly associated loci. (More precisely, the quantity % 2 -l 
increases by 1/(1 -R 2 ) on average, but for x 2 ^> 1 dX known loci, this distinction is minor.) This 
relationship can be seen explicitly in comparing equation (1) of ref. [31] for prediction R 2 using 



24 



Downloaded from http://biorxiv.org/on September 18, 2014 



BLUP, 

R2 = 1 + ^(1-^ 2 )' (23) 
with Supplementary Table 2 of ref. [12], which gives the following formulas for mean x 2 statistics 
using linear regression vs. mixed model association: 

„ Mh 2 
mean x lr at causal SNPs = 1 + — — — (24) 

^J-causal 

2 M K i 

mean x lmm at causal SNPs = 1 + • — , (25) 

J-Vl causa i 1 — K 

where R 2 in equation (25) (denoted r 2 h 2 g in ref. [12]) satisfies the same quadratic equation (23). 

In Fig. 3a, we plot increases in mean x 2 statistics at known loci for mixed model methods 
vs. PCA, and in Fig. 3b, we plot absolute prediction R 2 . As explained above, the increase in 
mean x 2 statistics at known loci for mixed model methods vs. linear regression — assuming no 
confounding — is approximately 1/(1 — R 2 ) -1« R 2 . When including principal components in 
linear regression to avoid confounding, we get an increase in mean x 2 statistics at known loci for 
mixed model methods vs. PCA of approximately -R 2 lmm — R 2 pca, where the latter is the prediction 
R 2 obtained by linear prediction using only the principal components. Thus, the red bars in Fig. 3a 
roughly correspond to differences between red and black bars in Fig. 3b, and analogously for 
blue bars. The correspondence is approximate because of the approximations mentioned above 
and two additional subtleties: (1) the LOCO scheme that BOLT- LMM uses to avoid proximal 
contamination renders the mixed model unable to condition out effects of markers on the same 
chromosome as the candidate marker; (2) our measurement of prediction R 2 uses 5-fold cross- 
validation, reducing the training sample size by 20%. However, these two effects have small 
magnitudes and act in opposite directions, so that ultimately the correspondence is very tight, as is 
visually apparent in Fig. 3. 

Pseudo-heritability. Following ref. [3], we refer to the heritability parameter inferred from vari- 
ance components analysis as "pseudo-heritability," denoted hj; scudo . Note that hp SCudo can be larger 
than hg (the heritability explained by genotyped SNPs [39]) due to cryptic relatedness or popula- 
tion structure [48], but for samples such as WGHS without substantial relatedness or population 



25 



Downloaded from http://biorxiv.org/on September 18, 2014 



substructure, hp SCudo is close to hg. 
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Figure 1. Computational performance of mixed model association methods. Log-log plots of 
(a) run time and (b) memory as a function of sample size (N). Slopes of the curves correspond to 
exponents of power-law scaling with N. Benchmarking was performed on simulated data sets in 
which each sample was generated as a mosaic of genotype data from 2 random "parents" from the 
WTCCC2 data set (N = 15,633, M = 360K) and phenotypes were simulated with M causal = 5000 
SNPs explaining h 2 causal = 0.2 of phenotypic variance. Reported run times are medians of five 
identical runs using one core of a 2.27 GHz Intel Xeon L5640 processor. We caution that running 
time comparisons may vary by a small constant factor as a function of computing environment. 
FaST-LMM-Select (resp. GCTA-LOCO, EMMAX) memory usage exceeded the 96GB available 
at N = 15K (resp. 30K, 60K). GEMMA encountered a runtime error (segmentation fault) at N = 
30K. Software versions: FaST-LMM-Select, v2.07; GCTA-LOCO, vl.24; EMMAX, v20120210; 
GEMMA, v0.94. Numerical data are provided in Supplementary Table 1. 
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Figure 2. BOLT-LMM increases power to detect associations in simulations. Mean \ 2 at 
causal candidate SNPs as a function of (a) number of causal SNPs, (b) proportion of variance 
explained by causal SNPs, (c) number of samples. Simulations used real genotypes from the 
WTCCC2 data set (N = 15,633, M = 360K) and simulated phenotypes with the specified number 
of causal SNPs explaining the specified proportion of phenotypic variance and 60 more candidate 
SNPs explaining an additional 2% of the variance. Error bars, s.e.m., 100 simulations. We 
verified on the first 5 simulations that the BOLT-LMM-inf and GCTA-LOCO statistics are nearly 
identical (Supplementary Table 6). Numerical data are provided in Supplementary Table 2. 
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Figure 3. BOLT-LMM increases power to detect associations for WGHS phenotypes. We 

compare power (measured using two roughly equivalent metrics) of linear regression using 10 
principal components, standard (infinitesimal) mixed model analysis, and BOLT-LMM Gaussian 
mixture model analysis, (a) Percent increases in x 2 statistics across known loci using mixed 
model methods vs. PCA: ratios of sums of x 2 statistics over typed SNPs in highest LD with 
published associated SNPs. (b) Prediction R 2 values from 5-fold cross-validation: each fold was 
left out in turn and predictions were computed by fitting all SNP effects simultaneously (for 
mixed model methods) or estimating covariate effects (for PCA) using the training folds. The 
correspondence between association power and prediction accuracy is such that the red bars in (a) 
roughly correspond to differences between red and black bars in (b), and analogously for blue 
bars (Online Methods). Error bars, jackknife s.e. Numerical data are provided in Supplementary 
Table 8 
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Table 1. Comparison of fast mixed model association methods that model all SNPs. 



Method 3 


Avoids proximal Models non-infinitesimal 
Requires 0(MN 2 ) time contamination genetic architecture 


EMMAX [3] 
FaST-LMM [9] 
FaST-LMM-Select [9, 1 1, 15] 
GEMMA [6] 

GRAMMAR-Gamma [10] 
GCTA-LOCO [12] 
BOLT-LMM 


X 

X b X 

X b X x c 

X 

X 

X X 

X X 



a For methods that have been updated over multiple publications, we cite and list characteristics of 
the latest published version. b If M < N, FaST-LMM and FaST-LMM-Select can complete in 
0(M 2 N) time. c FaST-LMM-Select models non-infinitesimal genetic architectures by restricting 
the mixed model to a subset of SNPs; a caveat of this approach is that it may incur susceptibility 
to confounding from stratification [12]. 
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Supplementary Note 

1 BOLT-LMM algorithm details 

The BOLT-LMM algorithm is overviewed in Online Methods. Here, we provide additional details 
for components of the computation not fully described earlier. 

1.1 Model setup 

1.1.1 Initialization: Missing data, normalization, and covariates 

Given a data set of genotypes and phenotypes, we apply the following procedure to create a nor- 
malized genotype matrix X. We begin by dealing with missing data as follows. First, we perform 
QC by filtering out SNPs and individuals with missing rates exceeding thresholds (default 10% 
for each). Second, we filter out individuals with missing phenotypes. Third, if the analysis in- 
cludes covariates, we filter out individuals with any missing covariates. (As an alternative to this 
approach, known as "complete case analysis," we also implement the option to use the "missing 
indicator method," which adds missing status indicator variables as additional covariates.) Finally, 
we replace missing genotypes in the remaining data with per-SNP averages. We denote by N and 
M the numbers of samples and SNPs remaining post-QC. 

Next, we mean-center each SNP and normalize the SNPs to have equal sample variance. We 
also mean-center the phenotypes. We model covariates by projecting them out from both genotypes 
and phenotypes, which is equivalent to including them as fixed effects. Explicitly, we compute a 
basis spanning the covariate vectors and subtract out the components of the SNP and phenotype 
vectors along the basis vectors. This procedure is mathematically equivalent to multiplying by 
an orthogonal projection matrix Pfi xe d that projects vectors to an (N — C) -dimensional subspace 
S C M. N orthogonal to the covariates, where C is the number of independent covariates (including 
the all- Is vector, which is implicitly included as a covariate upon mean-centering). For compar- 
ison, GCTA-LOCO [12] version 1.24 provides two options for treating covariates: (1) projecting 
out covariates from the phenotype vector only (the default, which is an approximate approach); 
or (2) fitting covariates as fixed effects along with the candidate SNP in each association test (the 
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— mlma-no-ad j-covar option, which is equivalent to projecting covariates out from both geno- 
types and phenotypes). The GCTA-LOCO documentation notes that the latter option significantly 
reduces the computational efficiency of GCTA-LOCO, but the BOLT-LMM implementation is not 
subject to this loss of computational efficiency. 

We denote by X the final N x M matrix of genotypes and y the final iV-vector of pheno- 
types after QC, normalization, and projection. While y and all columns of X (i.e., SNP vectors) 
are iV-dimensional vectors, it is important to keep in mind that they all actually belong to the 
(JV — C) -dimensional subspace S left after projecting by Pfi xe d- For example, when estimating 
variance parameters, accounting for the loss of degrees of freedom distinguishes restricted max- 
imum likelihood (REML) analysis — the preferred approach, which we implement — from maxi- 
mum likelihood (ML), and when computing x 2 test statistics, we need to use N — C instead of iV 
as the sample size. 

1.1.2 Models 

We therefore use the following general model to estimate hyperparameters (i.e., variance param- 
eters under the infinitesimal model and Gaussian mixture parameters under the non-infinitesimal 
model): 

y = X(5 + e vmh (26) 

where 

G S C M. : projected phenotypes 

= N x M matrix, x m G S C M. N for m = 1, . . . , M : projected genotypes 
G R M : iid random effects with E[(3 m ] = 0, Var(/3 m ) = cr^ 
G S C R N : iid random normal noise with e prq j ~ N(0, cTgPfi xe d)- 

Reiterating the point above, this "mixed" model does not contain any fixed effects; the fixed effects 
have been projected out, so that the entire model — including the noise term e prq j — lives in the 
subspace S. 



y 

X 

P 

; proj 
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Model for estimating infinitesimal model parameters. When estimating variance parameters 
under the infinitesimal model in Step la of the BOLT-LMM algorithm, we assume the SNP effect 
prior is the normal distribution 

and we estimate ct| and a 2 using the stochastic REML approximation algorithm described in Sec- 
tion 1.2. Note that the following notation is often used for the total genetic effect X/3 and its 
covariance: 

g = Xf3 
XX' 

K = — (the "empirical kinship" or GRM) 

a 2 K = Cov(g) 

in which case 

Model for estimating Gaussian mixture model parameters. When estimating Gaussian mix- 
ture parameters in Step 2a of the BOLT-LMM algorithm (after having obtained estimates of a 2 and 
a 2 in Step la), we generalize the SNP effect prior to the three-parameter mixture of normals 

{N(0,i7 2 gi ) with probability p 
(27) 
N(0, crg j2 ) with probability 1 — p, 

where we require 

pa% x + (1 - p)a% 2 = o% = a 2 g /M. 

We reparameterize the remaining two degrees of freedom using the parameters f 2 and p, where 
f 2 denotes the proportion of the total mixture variance within the second Gaussian (the "spike" 
component that models small genome- wide effects): 

pa% x + {l-p)(T% 2 
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Thus, Step 2a consists of estimating / 2 and p. 

Models for computing association statistics. When performing association tests in Steps lb 
and 2b of the BOLT-LMM algorithm, we modify the above models slightly, including the SNP 
being tested as a fixed effect, and leaving out all SNPs on its chromosome from the random effect: 

D = ^testAest + ^LOCO/3 + e pro j. 

As discussed in Online Methods, we do not by default re-estimate the hyperparameters a 2 , a 2 , /2, p 
when performing association tests; instead, we simply reuse the estimates obtained above from the 
(slightly different) models above, which include all SNPs in the random effect (and no fixed effect). 
However, we also offer the option of re-estimating a 2 and a 2 with each chromosome left out in 
turn. 

1.2 Variance component estimation: Stochastic REML approximation 

Here we provide details of our stochastic algorithm for estimating REML variance parameters in 
Step la of BOLT-LMM. As described in Online Methods, the crux of the method is to employ 
the observation that the REML first-order conditions on a 2 and a 2 are equivalent to the system of 
equations 

where both the left and right sides of each equation are functions of the assumed variance parame- 
ters a 2 and a 2 [44, equation (14.8)]. (The summations are over components of each vector, which 
is a slight abuse of notation for ||e|| 2 and ||/3|| 2 .) On the right sides, e data and /3 da ta are best lin- 
ear unbiased predictions (BLUP) on the observed phenotype data assuming the basic linear mixed 
model y = X/3 + e pro j, equation (26), with 

XX' 

C0v(y) = °g^j- + ^Pfixed- 

On the left sides, expectations are taken over the same quantities, with the phenotype data replaced 
by random y ran d generated according to the model y = X(3 + e pro j with variance parameters set to 
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the assumed a 2 and a 2 . 
Explicitly, defining 



6 = olja] (29) 
XX' 

H = -jf + SlN, (30) 
the BLUP estimates (3 and e for a phenotype vector y are: 

h = ^X'H^y (31) 
M 

e = 5H- X y. (32) 



(Note that technically, the scaled covariance matrix is actually Hs = a 2 Co\(y) = ^jf- + <5Pfi xe d; 
however, because all of our vectors belong to the subspace S, using I N instead of Pfi xe d produces 
the correct result and has the convenience of making the operator H invertible in the full space 
R N .) 

Equations (29)-(32) show that for a fixed value of 5, the BLUP predictions e and /3 are constant. 
Thus, the right sides of the REML first order conditions (28), involving BLUP on the observed 
phenotypes, depend only on the variance ratio 5 and are independent of the variance scale, which 
we may parameterize by a 2 (in which case a 2 = a 2 5 becomes a dependent variable). The left 
sides scale proportionately with a 2 because scaling up the variances scales up randomly generated 
phenotypes y rim d- Therefore, finding a 2 and a 2 that solve the pair of equations (28) is equivalent 
to: (1) finding the value of the single parameter 5 such that 



ElZe 2 } £e 



f^, (33) 

data 



and (2) choosing a 2 to scale the expectations to match the values observed on the data. (This 
procedure is analogous to the usual REML trick of optimizing over S and then setting a 2 = 
y'H-'y/iN-C) [45].) 

We propose an algorithm that rapidly estimates the ratio of expectations in equation (33) and 
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uses this estimate within a one-parameter search over 8. Define 

Wlog*)-log^-/ EE?])' 

For a fixed value of 5, we produce 3-15 (decreasing with N) Monte Carlo estimates of the expec- 
tations by generating random phenotypes 

|/rand X ftmnd ^rand, proj ; 

where 

A-and ~ iid N (o, ^l/M^j 

e rand ~ udN(0,VS) 

^rand, proj PfixedCrand- 

(Note that as the variance scale parameter a 2 g is irrelevant for this part of the computation, we set it 
to 1 for convenience. Note also that the use of the projection matrix Pfi xed is what makes this proce- 
dure compute REML rather than ML estimates.) We then run BLUP (using the chosen value of 5) 
using conjugate gradient iteration [26, 27] to obtain Monte Carlo estimates of E[%2 /3 2 ]/E[J2 e 2 ], 
and we likewise run BLUP to compute /3 data and e data , which together give an estimate /reml(^) °f 
/reml(<5). 

We wish to find 6 such that /reml ( log 8 ) = 0. We have observed empirically in simulations 
and real data sets that /reml (1o§ $ ) increases monotonically with 5 except possibly at extremely 
small or large values of 5 outside the range of reasonable parameter values (corresponding to, 
say, 0.01 < /ip Seudo < 0.99). Thus, one approach to finding the unique zero of /reml within the 
reasonable parameter range is binary search, with the caveat that some care is needed because we 
can only compute noisy Monte Carlo estimates /reml of /reml- 

We implement the following more robust approach. Instead of independently re-randomizing 
phenotypes |/ rand used to compute estimates /reml at different values of 8, we a generate a single 
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set of random phenotypic component pairs 

ftand ~ N (o, y/l/M) 
Grand, unsealed ~ #(0,1) 
Grand, unsealed, proj PfixedGrand, unsealed 

and then use these component pairs to generate random phenotypes 

1/rand Wand + Vd ■ 

Grand, unsealed, proj • 

for any given 5. Using this approach, the Monte Carlo estimate /REML(log<5) becomes a smooth 
function of log 5, allowing us to use the secant method, a finite difference approximation of New- 
ton's method, to perform zero-finding on /reml- 

1.3 Gaussian mixture model fitting: Variational Bayes iteration 

As outlined in Online Methods, the core component of the BOLT-LMM Gaussian mixture mod- 
eling algorithm is a variational iteration [20-22, 35] that computes approximate posterior mean 
effect sizes (3 rn for the Gaussian mixture model version of equation (26). We sketch the iteration 
in slightly more detail here, leaving a full description for Section 2. 

The idea of the iteration is to obtain successively better approximations of the SNP effect sizes 
by cyclically updating each estimated SNP effect with its posterior mean conditional on the current 
estimates of all other SNP effects. Explicitly, we begin by initializing each SNP effect (3 m to 0 and 
initializing the residual phenotype y res j d = y—X(3 to y. Then, each iteration performs the following 
loop over SNPs m — 1, . . . , M: 

1 . Remove effect of SNP m from residual. y resid _ m «— y resid + x m (5 m 

2. Re-estimate effect of SNP m. f3 m <— posterior mean given y res id 

3. Replace effect of SNP m in residual. |/ resid y res id,-m — x m (3 m 

Step 2 amounts to computing the posterior mean of (3 m given its assumed mixture prior in equa- 
tion (27), the environmental noise distribution e pro j ~ iV(0, o"gP nxe d), and the residual phenotype 
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2/resid,-m given current estimates of all other SNP effects. Steps 1 and 3 are a computational tech- 
nique allowing the residual in Step 2 not to be computed from scratch for each SNP, so that an 
entire iteration (updating each SNP effect once) takes O(MN) time. 

The basic structure of this iteration is the shared with previous variational methods [20-22,35], 
with the update Step 2 varying among methods depending on the assumed SNP effect prior. As 
noted in Online Methods, an additional slight difference is that BOLT-LMM estimates the hyperpa- 
rameters a 2 , a 2 using REML and f%, p using cross-validation [15], whereas previous approaches 
have performed hyperparameter estimation within the variational iteration [22, 35] or using the 
variational approximate log likelihood [21]. We used cross-validation instead because we found 
cross-validation estimates to be more robust to decreases in the quality of the variational approxi- 
mation caused by linkage disequilibrium. 

1.4 Cross-validation for estimating Gaussian mixture parameters 

We perform 5-fold cross-validation to select best-fit Gaussian mixture model parameters accord- 
ing to maximal prediction R 2 . For large sample sizes iV > 10,000, computing a subset of the 
cross-validation folds is already sufficient to obtain parameter estimates that achieve near-optimal 
association power, so by default, we compute only enough folds to make predictions on 10,000 test 
fold samples. This observation holds because absolute prediction R 2 corresponds directly to the 
power of the corresponding retrospective association test (as measured by increase in x 2 statistics 
at truly associated loci, or equivalently, increase in effective sample size; see Online Methods), and 
the standard error of prediction R 2 estimated via cross-validation scales as the inverse square root 
of the number of test samples. 

1.5 Performance optimization 

Beyond the broad algorithmic techniques we have described, the BOLT-LMM software employs a 
variety of performance optimizations that provide large constant-factor savings (>10x) in memory 
and running time over a straightforward implementation. 
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1.5.1 Memory 

Computation on raw genotypes. The key optimization that we implement for memory effi- 
ciency is direct computation on raw genotypes. Raw genotypes can be stored in only 2 bits per 
base, versus 64 bits (8 bytes) for standard double-precision floating point values. However, all pre- 
vious methods necessarily work with floating point matrices (either the N x N GRM or a floating 
point representation of the N x M normalized genotype matrix) because they perform spectral 
decomposition. In contrast, because BOLT-LMM applies iterative methods that use the genotype 
matrix only in matrix multiplications, it is enough for us to store the raw genotypes plus lookup 
tables containing normalization information that we apply on-the-fly. Explicitly, for each SNP, 
rather than storing a normalized genotype vector in memory, we simply store its raw allele count 
(0, 1, 2, or missing) for each individual and additionally record its mean allele count and normal- 
ization constant. Then, when performing computations involving the SNP, we build a properly 
normalized SNP vector using the above information. Importantly, this vector can be thrown away 
after the computation, thus keeping BOLT-LMM 's memory footprint small. 

An additional subtlety arises when working with covariates because normalized genotype vec- 
tors no longer contain only 4 values after projecting out covariates. In this case, we store the 
components of each normalized genotype vector along a basis that spans the covariates, and we 
do the same for the phenotype vector. We treat these "covariate component" values as additional 
coordinates that we carry along in all computations, so that whenever we need to compute a dot 
product (the basic computation all of our iterations use), we can do so by taking the usual dot 
product and subtracting the dot product of the additional covariate component vector. 

Streaming computation of association tests. In analyses of data sets containing very large num- 
bers of SNPs (e.g., millions of imputed SNPs) that may be stored as real-valued dosages, it is often 
desirable to compute association statistics for all SNPs but use only a subset of genotyped SNPs 
in the mixed model. In this situation, we retain memory efficiency by reading and analyzing SNPs 
not used in the mixed model only when performing final computation of retrospective association 
test statistics. That is, we first compute a set of residual phenotypes 2/ res id-L0C0 using only the sub- 
set of typed SNPs in the model; then, we successively read each test SNP and compute and output 
its association test, throwing away the SNP after completing the computation. This streaming 
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computation allows us never to store the full set of SNPs in memory. 
1.5.2 Computational speed 

Batch computation using optimized matrix subroutines. The iterative methods we have de- 
scribed reduce the association computation to basic building blocks of vector operations (for vari- 
ational iteration) and matrix-vector operations (for conjugate gradient iteration). These operations 
can be performed using optimized implementations of the Basic Linear Algebra Subprograms 
(BLAS), but all BLAS libraries achieve maximal speed when performing matrix-matrix "Level 3 
BLAS" operations. We therefore batch our computations into matrix-matrix multiplications by 
performing simultaneous updates across SNPs and parameter values. 

Conjugate gradient iteration uses Level 2 BLAS operations as written. Variational iteration 
only uses Level 1 BLAS as written (Section 1 .3), so to step up to Level 2 BLAS, we perform block 
updates of SNP effect sizes. Explicitly, the basic variational iteration consists of re-estimating each 
SNP effect in turn conditioned on current estimates of other SNP effects, which requires computing 
dot products of each SNP with the residual phenotype vector. Instead of computing dot products 
one at a time (a Level 1 BLAS operation), we compute dot products of a block of SNPs at once 
(a matrix- vector multiplication, which is Level 2 BLAS). When we subsequently update the effect 
size of each SNP in turn, we just need to be careful to update its dot product to reflect changes that 
have been made to the residual vector due to previous updates of previous SNPs within the block. 
We do so using an LD matrix that we precompute for each SNP block. 

To step up from Level 2 BLAS to Level 3 BLAS, we make use of the fact that all computation- 
ally intensive steps of the BOLT-LMM algorithm require multiple replicates of almost the same 
computation: Step la computes BLUP on different random phenotypes; Step lb solves a set of 
similar linear systems for different LOCO reps and calibration SNPs; Step 2a computes variational 
Bayes assuming different hyperparameter values; Step 2b computes variational Bayes for differ- 
ent LOCO reps. With some care, we can in each case simultaneously perform iterations across 
the different replicates. The upshot is that by using batch computations, BOLT-LMM performs 
all 0(MN)-time operations using BLAS 3 matrix-matrix multiplications (DGEMM). Our software 
distribution uses the well-tuned Intel Math Kernel Library (MKL) BLAS implementation. 
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Multithreading. Conveniently, most BLAS implementations also support optimized multithreaded 
computation on multi-core processors, which are now commonplace. In practice, multithreading 
rarely decreases computation time by a factor equal to the number of cores used because of vari- 
ous overhead costs, but Level 3 BLAS operations often come close to achieving this theoretically 
maximal speedup. The BOLT-LMM software supports multithreading through BLAS, and we rec- 
ommend using this option whenever multiple cores are available. We performed our benchmarking 
analyses using single-core computation simply to give a fair comparison against methods that do 
not support multithreading. 

Low-level optimization. We reduce the overhead of loading raw genotypes and building normal- 
ized genotype vectors using a few additional low-level tricks. Instead of looking up normalized 
genotype values one at a time when loading a SNP, we build an intermediate lookup table that maps 
all 256 possible values of 4 consecutive genotypes (stored in 1 byte) to a 4-vector of normalized 
values. When loading these 4-vectors from the lookup table into the normalized SNP vector being 
built, we use streaming SEVID extensions (SSE instructions) to perform multiple loads at once. 

2 Theory 

2.1 Variational Bayes 

In Bayesian analysis, one specifies a probability model over observations and model parameters, 
often wishing to obtain posterior mean estimates of parameters of interest. Unfortunately, the 
posterior distribution is typically infeasible to integrate over. The idea of the variational framework 
(termed "variational Bayes") is to approximate the posterior distribution with a factorized form that 
is much easier to work with computationally. The factorized form can be integrated over, allowing 
calculation of approximate posterior mean estimates. For a full discussion of variational Bayes, 
we refer to ref. [49]. Previous variational methods for Bayesian linear regression in genetics have 
been presented in refs. [21,22,35]. Here, we briefly summarize key aspects of variational Bayes for 
quick reference and to establish notation. We then derive formulas giving the variational iteration 
used by BOLT-LMM, and we further derive theory establishing equivalence between this family 
of variational methods and penalized linear regression. 
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2.1.1 Terminology and notation 

We begin by considering a general probability model p(y, (3) over observations y and parameters 
ft We assume that p(y, (3) takes into account the prior distributions of ft so that p(y, (3) is the joint 
probability of sampling parameters f3 and observing y. 

• The true log likelihood (LL) of observing y is the log of the integral of p(y, (3) over all 
possible values of the parameters ft 



This integral is typically intractable to compute. 

• The true posterior distribution of the parameters (3 conditional on observing y is given by 
normalizing the joint probability p(y, (3) by the likelihood: 



• The variational approximation to the posterior distribution is the best approximation of the 
posterior distribution p((3 \ y) with a distribution q((3) that factors: 



The factors are usually constrained to have simple forms. For instance, if f3 represents 
a set of individual parameters ft, each factor g» may be required to be a function of only 
one parameter ft, in which case qi is an approximate marginal posterior for ft. This fully 
factorized variational approximation is the approach we consider here. 

• The approximate log likelihood is a variational lower bound on the true log likelihood, given 



true LL = log / p(y, (3)df3. 



true posterior distribution = p((3 \ y) 



Jp(y,PW 



approx posterior distribution = q{(3) — I I q% ~ p((3 \ y). 



by: 



approx LL = C(q) 




(34) 



\ruoLL-D KL (q(f3) || p((3\y)), 
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where Dkl denotes the Kullback-Liebler (KL) divergence between probability distributions. 
Note that the second line says that the gap between the variational lower bound and the true 
LL is given by the KL divergence between the approximating distribution q and the true 
posterior distribution p(/3 | y). 

Variational iteration, which we describe in the next section, successively refines the approxi- 
mation of p((3 | y) with q{(3) by iteratively updating the factors qi, reducing the KL divergence 
Dkl(q(P) || p{/3 | y)) and therefore monotonically improving the lower bound C(q). The point 
of the iteration is that at convergence, the distribution q{(3) will ideally be a good approximation 
of the true posterior distribution p{(3 \ y) that can easily be integrated over (because of its factored 
form) to perform approximate posterior inference. 

Before moving on, we note that it follows from the above that the faithfulness of the approx- 
imating distribution q{(3) to the true posterior p((3 \ y) determines the accuracy of the resulting 
approximate posterior inference. Because we only consider approximating distributions q((3) that 
factor as Y\ i the optimal quality of the approximation depends on the extent to which the pos- 
terior distribution p((3 \ y) can be approximately factored in the prescribed manner. For Bayesian 
linear regression, which is our focus here, correlation among regressors (i.e., linkage disequilib- 
rium among markers) tends to pose a challenge for some types of inference using variational meth- 
ods. For example, when regressors are correlated, the approximate posterior is often too tightly 
concentrated (i.e., overconfident of parameter localization). However, aggregate inference can still 
be robust. We refer to ref. [21] for an in-depth exploration of these issues. 

2.1.2 General variational iteration 

The variational Bayes algorithm iteratively updates each approximate marginal distribution q^ using 
the update step 

log( 9i (ft)) <- E Qih i^i[\ogp{y, (3)} + C. 

As written, the above update step updates the entire distribution qi — a. function of — to its new 
optimum (as a distribution, optimizing in the sense of calculus of variations). Indeed, the only 
assumption made by the variational approximation is that the approximate posterior distribution 
q(f3) factors (i.e., the parameters are assumed to be conditionally independent, conditional on the 
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observed outputs): no assumption is made about the functional form of the factors qi((3i). However, 
the factor distributions q t are typically characterized by sets of sufficient statistics, so that in each 
iteration, only the sufficient statistics for factor i need to be updated based on the current values 
of the sufficient statistics for each of the other factors. Moreover, we will show in Section 2.1.3 
that in the case of Bayesian linear regression, the factor distributions take the form of conditional 
posterior distributions. For the Gaussian mixture priors we use in BOLT-LMM, these conditional 
posteriors retain the Gaussian mixture form, just with different parameters. 

Convex optimization theory guarantees the convergence of variational iteration [46] . Moreover, 
convergence of the approximate log likelihood, which monotonically increases during the iteration, 
serves as a convergence criterion. At convergence, approximate posterior means of the parameters 
(3i can be read off from the approximate marginal distributions qf. for the special case of fully 
factorized Bayesian linear regression, these values are among the sufficient statistics updated at 
each iteration. 

2.1.3 Variational iteration for Bayesian linear regression 

We now specialize to the Bayesian linear regression model 



y = X/3 + e 
e ~ N(0,a 2 e l N ) 
prior on /3 m = vr m (/3 m ) 



That is, we assume each regression coefficient (i.e., SNP effect) has a prior distribution 7r m (/3 m ) and 
the response y is the sum of regressor effects plus iid Gaussian noise with variance a\. (Note that 
this noise model does not take into account projecting out covariates as described in Section 1.1; 
projecting out C independent covariates just reduces N to N — C in the equations below.) The 
joint probability p(y, (5) of parameters (5 and observations y satisfies 

—j==\ exp ^ — J Yl 7Tm(/3m)- (35) 

Below we derive the fact that the fully factorized variational approximation matches the it- 
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erative conditional update algorithm discussed in Section 1.3. The fully factorized variational 
approximation takes the form 



M 



q(P)=Hqm(MKp(j3\v). 



m=l 



The variational update step for the approximate marginal distribution q m optimizes the KL diver- 
gence D KL (q\ \p) over all distributions q m , fixing the other marginals q m > to their current distribu- 
tions, thereby monotonically increasing the approximate (lower bound) log likelihood C(q). The 
update amounts to: 



\og(q m (Pm)) <- Eq mli p m , )t m ,^ m [\ogp(y, $)] + C 



\y-xp\\ 



2 



2a? 

e m'=l 



+ 2J log7r m /(/3„ 



+ C 



y X— m (3_ m x m f3 m \\ . . 

V \ogir m (fj m ) + C 



2al 

IZ/resid,— m •^mP: 



2(7? 



+ log7T m (/3 m ) + C, 



where (3_ m denotes the vector of estimated posterior mean effect sizes at SNPs other than m 
according to the current approximate marginal posterior distributions q m >([3 m i). (In the above se- 
quence of equations, we absorb all terms independent of f3 m into the constant C; the use of "+C" 
in all lines does not imply that the constant is the same from line to line.) To see how the posterior 
mean estimates (3_ m enter the equation in the third line above, consider expanding the quadratic 
term | \y — Xf3\ | 2 inside the expectation. (Note that taking an expectation over q m >(/3 m '), m! 7^ m 
corresponds to integrating over q m '(f3 m >)df3 m >.) 

• For terms that are linear in (3 m i (for m! 7^ m), E[/3 m >] can be replaced with /3 m , by linearity 
of expectation over q m >(f3 m '). 

• Terms that are quadratic in (3 m > are independent of (3 m , so can be replaced by (3 m , + 
Var 9m , (fi m t) to re-complete the square \ \y — X^ m /3_ m — x m /3 m \\ 2 . The leftover terms, pro- 
portional to Var 9m , (/3 m r) = E qm , — f3 m ,, are constant (with respect to f3 m ) and can be 
absorbed into the constant term. 
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Finally, the contributions of the prior terms 7r m /(/5 m /) for w! ^ m also become additive constants 
upon taking expectations and can be absorbed into the constant term as well. 
Exponentiating the result above gives 



|?/resid,— m ■Emfi-i ' 



la \ I yresid,— m •^mh'm\\ \ /n \ riei\ 

1m\Pm) oc exp ( — ] ir m {p m ), (36) 



e 



which is simply a conditional posterior distribution for (3 m (given the prior 7r m (/3 m ) and conditional 
on setting all other /3 m i to their variational expected means). That is, for Bayesian linear regression, 
the optimal approximating marginal distribution q(/3 m ) (making no assumptions about the form of 
each marginal, only requiring that the approximating posterior fully factorizes) turns out to be the 
conditional posterior. In particular, as we derive in Section 2.1.4, if the prior has the nice form of 
a mixture of Gaussians, then the approximating marginal keeps that form (with different means, 
standard deviations, and weights). 

Moreover, the approximate log likelihood C(q) can be obtained by substituting the joint prob- 
ability p(y, (5) = Pnoise(y — X(5) Y[ 7T m , given in equation (35), into equation (34): 

C{q) = J q ((3)\og^^d(3 

= E q [logp aohe (y - Xff)] + ^2 E q [log n m /q m ] 
= E q [logp aohe (y - X(3)] - ^ D KL{q m \\^m) 

log >„z _ _ _ E Dkl{ , 



N 

2 " " 2a- 2al 

(37) 



, 0 2 \\y~ x P\\ 2 Elkm|| 2 Var gm (/3 m ) >^ 

log2vra e —~ 2 —~ 2 + 2_^{E qm [\ogn m \ + H(q m )), 



e 



where (3 m = E qm [f3 m ] , using the same trick of completing the square as above, only now we have 
to be careful to account for the leftover constant terms involving Var gm (/3 m ). In the last line, H(q m ) 
denotes information entropy. Once again, we note that if C independent covariates are projected 
out from our model as described in Section 1.1, then we simply reduce iV to iV — C in equation 
(37) to account for the lost degrees of freedom. 
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2.1.4 Update equations for Gaussian mixture prior 

We now specialize to Bayesian linear regression with the specific Gaussian mixture prior 



iV(0, x ) with probability p 

Pm ~ { (38) 



N(0, ai 2 ) with probability 1 — p, 



i.e., 



n m {Pm) = V ■ i exp I ) + (1 - p) ■ . exp ( --^ 

2naj tl V 2<T f*J \/ 27r<r ?,2 V ^ 

Then the conditional marginal distribution is given by 

1 I Wm-I3 m ,l) 2 \ , , 1 / Wm-Pma) 

= eX P ^2 +1 1 -Pm) • , 



i.e., 



N (P m ,n r m,i) witn probability 

Pm 

N (Pm?, r m,2) witn probability 1 - p r< 



where 



Pr, 



T 



r. 



12 



^™ : ^+(47iwi a ) A ' 



Pr, 



«1 



P e -/3^/2 S 2 



SI S2 



q m (/3m) = Pm ■ i exp I —J — - — + (1 -p m ) ■ , exp . 

^T™, i / / o„^-2 \ ZT, 



m,2 



where s 2 and s 2 , are the variances of the marginal distribution of /3 m in each case (assuming 
2/resid -m = ^m/^m + e and marginalizing over random /3 m and e): 



S 2 ,:=a 2 ' ^ 



fc •- u /3,fc i|2 
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for A; = 1,2. These equations explicitly give the update formulas for the BOLT-LMM variational 
iteration (Step 2 in Section 1.3), in which each estimated SNP effect is set to its conditional poste- 
rior mean. 

To compute the approximate log likelihood for use in testing convergence of the iteration, it is 
convenient to consider a reparameterization of the model (following ref. [21]) in which instead of 
having one parameter (5 m for each SNP (with n m (/3 m ) a mixture of two Gaussians), we introduce 
a state parameter s m e {0, 1} and consider a prior n m ((3 m , s rn ) with: 



T^mifim | 


Sm = 0) 


= Gaussian 1: iV(0, cr| x 


m ifim | 




= Gaussian 2: N(0, a 2 2 


TTm 1 


[Sm = 0) 


= mixture fraction: p. 



This parameterization gives exactly the same model as the original Bayesian linear regression, with 
the only difference being that the mixture prior on (3 m in equation (38) is replaced by the joint prior 
distribution 7r m (/3 m , s m ) above with the extra hidden state s m . 

We can now apply the variational approach to factorize the approximate posterior distribution 
over all SNPs — in (/3 m , s m ) pairs — as a product of joint distributions q m (Pm, s m ). Each optimal 
factor distribution q m has the property that when integrated over s m , it matches the variational 
approximation q m (/3 m ) from the original parameterization, implying that the estimated posterior 
means are identical. This parameterization lends itself more easily to computation of KL diver- 
gences, however. We have the formula: 

D KL {q m {P m ) || 7T m (/3 m )) = D KL (q m ((3 m , S m ) || 

i Pin /-. \ i 1 Pm 

= p m log h(l-p m )log 



i , ^ fill* J -| 

P 1 — p 

1+lOE 



,, / r 2 T 2 i a 2 

Pm I 1 , rrc.l m,l 1 r'mA 



/ 2 2 "/Q ^ 



1 + lOE 



a 2 a 2 

° /3,2 °/3,2 
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Writing 

Var 9m (/3 m ) = E qm [(5 2 m ] -~tf m = p m ( T m ,i + P m ,i) + i 1 ~ Pm)(r mj2 + ~f m ^ - P m , 

we have all of the terms needed to compute C(q) according to equation (37). 

We note that in the limit erg 2 -» 0 of the point-normal prior used by ref. [21], the formulas 
become 

D KL (q m (f3 



Var gro (/3 m ) 

matching ref. [21]. 
2.1.5 Equivalence with penalized linear regression 

We now show that the variational iteration of Section 2. 1 .3 for Bayesian linear regression is equiv- 
alent to coordinate descent applied to a penalized linear regression problem. Equivalently, the fully 
factorized variational approximation to Bayesian linear regression (for an arbitrary choice of prior 
on regressor effects) can be recast as applying a transformation to the prior and then finding a 
posterior mode of the (new) Bayesian linear regression with transformed prior. More precisely, we 
have the following equivalence of optimization problems: 

1 . Maximize the variational approximate log likelihood (for the fully factored VB approxima- 
tion to Bayesian linear regression). 

2. Minimize the penalized linear regression objective function 

||y-X/3|| 2 + ^penalty m (/3 m ), 
where the penalty is derived from the prior in the original Bayesian linear regression. 



i _ Pm /-. \ i 1 Pm 

p m log h (1 - p m ) log 

p 1 — p 



Pr, 



1 + log 



T, 



m,l 



T 2 4- 8 



a 



13.1 



a 



0A 



Pm \t~m,l 



Pm( T m A + 0 m ,l) {PmP m ,lY 



55 



Downloaded from http://biorxiv.org/on September 18, 2014 



3. Find a posterior mode of the Bayesian linear regression with prior 

7Tm(/3m) oc exp (- penalty m (/3 m )), 

which we can view as a transformation of the original prior Tc m (/3 m ). 
We also have the following equivalence of algorithms: 

1 . Variational iteration applied to the original Bayesian linear regression. 

2. Coordinate descent applied to the corresponding penalized linear regression. 

Implications for numerical optimization. The equivalence of variational Bayes with penalized 
linear regression (in the context of Bayesian linear regression) elucidates some numerical proper- 
ties of the algorithm that are not immediately apparent. In particular, penalized linear regression is 
in general a numerically challenging non-convex optimization. We can therefore expect variational 
iteration to be susceptible to numerical issues such as convergence to local optima. Some methods 
have tried to address this problem by repeating the iteration for different random choices of update 
orders or different initialization points [21,22,35]. In our simulations, we found that using a single 
run of the iteration was typically already sufficient for BOLT-LMM to achieve most of the avail- 
able power gain, however, so we opted to use just a single run to avoid increasing computational 
cost. 

Mathematical derivation. As discussed in Section 2.1.1, variational theory gives the following 
lower bound to the log likelihood, which variational iteration tries to maximize: 

M<?J - _ y 10g27TCr e — — 2 ^KL{qm\\^m)- 

Note that C(q) is afunctional, i.e., it evaluates factored distributions q = q\ - ■ ■ qu that attempt to 
approximate the posterior. We will show that in fact, this maximization is equivalent to optimizing 
the objective function (over effect size estimates j3 m ) of a penalized linear regression: 

L0) = ~ log27T<7 e 2 - lly ~^ 112 - $>enalty m (/L). 
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The key is that we can define a mapping j3 m i->- q m ( ■ ; j3 m ) with S ?m [/3 m ] = j3 m and define 

penalty m (/3 m ) := 1 ' ; L)) + ^ g JfL (g m ( ■; 0 m )\\„ m ), 

with the property that 

(penalized LR objective) L(j3) = £ = Y\ Qm( ' ] 0m)j (approx LL with derived q m ) 

< max{£(g):g = J]g m ,E (?m [/3 m ] = /3 m } (39) 

< max£(g). (40) 

(Note that if the SNPs are normalized such that \ \x m \ \ are all equal, then the penalty is independent 
of m.) 

The mapping f3 m \-t q m ( ■ ; /3 m ) is derived from the variational update step, which we know 
chooses the q m that optimizes C(q) conditional on the current choices of the other marginal dis- 
tributions. Thus, equality holds in (39) at any convergence "point" of the iteration. Here a con- 
vergence "point" really means a choice of approximating distributions q t , . . . ,q M ; however, if the 
above mapping exists, these distributions may be parameterized by . . . , Equality holds 
in (40) at the global maximizer of the variational approximate log likelihood. It follows that 
L0), which is upper-bounded by the global maximum of C(q), attains that global maximum at 
$m,opt — E qm [f3 m ]. That is, the solution to the penalized linear regression optimization of L{p) 
corresponds to the solution to the variational optimization of C(q): if we could solve the penalized 
linear regression optimally, we would have the optimal variational Bayes solution. 

All that remains is to define the mapping j3 m (-> q m ( ■ ; j3 m ), which we obtain indirectly from 
the variational update step via the key observation that the optimal marginal q m conditioned on 
all other marginals depends only on a single statistic: the correlation of x m with the residual 

Z/resid,— m V 

J2m'^m x m'Eg m , [/3 m /]. That is, we have: 



T 

•^jrtl/resid,— m ' ^ Qm ' ' 



penalty terms Var(g m ), D KL (q 
conditional posterior mean E qm [/3 m 



Moreover, the conditional posterior mean E qm \j3 m ] is an increasing function of the correlation 



57 



Downloaded from http://biorxiv.org/on September 18, 2014 



statistic x^y res id -m, meaning that we can invert the final mapping to obtain E qm [/3 m ] \-> q m , giving 
the desired mapping j3 m i-> q m (-; j3 m ). If we wanted to derive an explicit formula for penalty m (/3 m ), 
we could try to invert the mapping a^y residi _ m ^ E qrn [/3 m ], but in general there is probably no 
closed form. 

2.2 Convergence rate of BOLT-LMM iterative computations 

Because BOLT-LMM applies iterative methods for numerical linear algebra, its running time de- 
pends not only on the cost of matrix operations, which scales linearly with M and N, but also 
the number of 0(MN)-time iterations required for convergence, which is largely determined by 
the condition number of the phenotypic covariance matrix and in particular increases with sample 
size (N), heritability, relatedness, and population structure. Simulations show empirically that the 
number of iterations scales roughly with y/~N and does not change dramatically within the range of 
typical values of the other parameters, hence our overall time complexity estimate of ^0(MN 15 ) 
Additionally, the high observed-scale heritabilities of case-control ascertained traits and high 
pseudo-heritabilities of phenotypes in family data, in combination with population structure or 
relatedness among samples, may reduce the computational efficiency of BOLT-LMM (Supple- 
mentary Fig. 2); an avenue for future investigation is to mitigate this effect by including PCs to 
improve the numerical conditioning of the computation. 

3 Parameters of mixed model software used in analyses 

We ran BOLT-LMM with default options except in analyses investigating power of the Gaussian 
mixture model, in which we used the — forceNonlnf option to fit the Gaussian mixture model 
even in scenarios with no expected power gain. 

We ran the 2012-02-10 intel64 release of EMMAX with default options. We ran GEMMA 
version 0.94 with default options. We ran GCTA version 1.24 with the — mlma-loco option for 
leave-one-chromosome-out analysis. We ran FaST-LMM version 2.07 using all markers to com- 
pute the similarity matrix. We ran FaST-LMM- Select version 2.07 with the autoSelect option 
-auto Select SearchValues 

"0,1,3,10,30,100,300,1000,3000,10000,30000,100000,300000" 
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to test increasing numbers of selected SNPs up to a maximum of all 300,000 SNPs. We bench- 
marked the precompiled executable distribution of each software package. 
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Supplementary Figure 1. Detailed computational cost metrics from running BOLT-LMM on 
simulated data sets with increasing sample size (Fig. 1): (a) running time, (b) memory usage, (c) 
conjugate gradient iterations used in LOCO analysis, (d) variational Bayes iterations used in 
LOCO analysis (max among 22 LOCO reps). Note that the conjugate gradient computation (c) is 
required to compute both the BOLT-LMM-inf and BOLT-LMM statistics, whereas the variational 
Bayes computation (d) is relevant only to the BOLT-LMM statistic. Additionally, BOLT-LMM 
skips the LOCO variational Bayes computation when estimated improvement in prediction R 2 
using the Gaussian mixture model is small (<1%); in (d), this behavior occurred for N<60K. 
(Note that in such cases, some variational Bayes work is still needed in making this 
determination; see Online Methods for details.) Each plotted point corresponds to one simulation 
with M causa i = 5000 SNPs explaining /i 2 caU sai = 0.2 of phenotypic variance. Reported run times are 
medians of five identical runs using one core of a 2.27 GHz Intel Xeon L5640 processor. 
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Supplementary Figure 2. Dependence of BOLT-LMM computational cost metrics (see 
Supplementary Fig. 1) on heritability explained by genotyped SNPs (h 2 g ), sample size, and sample 
structure. BOLT-LMM was run on simulated data sets with N = 30,000 or 60,000 samples, each 
generated as a mosaic of genotype data from 2 (left panels) or 10 (right panels) random 
"ancestors" from the WTCCC2 data set (N = 15,633, M = 360K). Phenotypes were simulated 
with M causa i = 5000 SNPs explaining /i 2 ca usai = 0.15-0.35 of phenotypic variance. In order to 
measure variational Bayes iterations (e, f) used in LOCO analysis for all parameter combinations, 
BOLT-LMM Gaussian mixture model analysis was run to completion even when estimated 
improvement in prediction R 2 using the Gaussian mixture model was small (i.e., the default 
behavior was overriden). Error bars, s.e.m., 10 simulations. 
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Supplementary Figure 3. BOLT-LMM increases power to detect associations in simulations 
while maintaining false positive control, (a) Mean x 2 at causal candidate SNPs. (b) Mean \ 2 at 
null SNPs (i.e., SNPs not in LD with causal SNPs). Simulations used real genotypes from the 
WTCCC2 data set (N = 15,633, M = 360K) and simulated phenotypes with the specified number 
of causal SNPs explaining 50% of phenotypic variance and 60 more candidate SNPs explaining 
an additional 2% of the variance. Error bars, s.e.m., 100 simulations. Plotted data are for 
BOLT-LMM, BOLT-LMM-inf, and EMMAX statistics. We verified on the first 5 simulations that 
the BOLT-LMM-inf and GCTA-LOCO statistics were nearly identical and that the EMMAX and 
GEMMA statistics were nearly identical (Supplementary Table 6). 
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Supplementary Figure 4. Power of BOLT-LMM and existing mixed model association methods 
to detect causal candidate SNPs in the simulations of Fig. 3 (real genotypes from the WTCCC2 
data set with N = 15,633 and M = 360K, simulated phenotypes with varying numbers of causal 
SNPs). Error bars, s.e.m., 100 simulations. 
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Supplementary Figure 5. Power increase of BOLT-LMM Gaussian mixture model association 
analysis over standard mixed model analysis as a function of sample size, number of causal SNPs, 
and heritability explained by genotyped SNPs (h 2 g ). Plotted are ratios of mean x 2 BOLT-LMM vs. 
BOLT-LMM-inf statistics at causal candidate SNPs as a function of (a) M causal or (b) 
h 2 g JS ' I 'M causal- This metric is equivalent to increase in effective sample size using the Gaussian 
mixture model. Data sets used N = 30,000 or 60,000 simulated individuals, each generated as a 
mosaic of genotype data from 2 (left panels) or 10 (right panels) random "ancestors" from the 
WTCCC2 data set (N = 15,633, M = 360K). Phenotypes were simulated with M causal = 
2500-15,000 SNPs explaining /i 2 causa i = 0.15-0.35 of phenotypic variance and 60 more candidate 
SNPs explaining an additional 0.005 (N = 60K) or 0.01 (N = 30K) of the variance. In all 
simulations, both BOLT-LMM and BOLT-LMM-inf statistics were properly calibrated (mean x 2 
= 1.00-1.01 at null SNPs). Error bars, s.e.m., 10 simulations. 
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Supplementary Figure 6. Q-Q plots of BOLT-LMM x 2 statistics (a, c) and BOLT-LMM-inf x 2 
statistics (b, d). The observed quantiles of both association statistics at null SNPs (a, b) match 
theoretical x 2 quantiles. The observed test statistics over all SNPs (c, d) show lift-off consistent 
with polygenicity as simulated. Data shown are from one simulation using the setup of Fig. 3 
(real genotypes from the WTCCC2 data set with N = 15,633 and M = 360K, simulated 
phenotypes with 5060 causal SNPs explaining 0.52 of phenotypic variance). To reduce clutter, 
5% of the SNPs are plotted. 
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Supplementary Figure 7. Scatter plot of BOLT-LMM-inf vs. GCTA-LOCO x 2 statistics. 
BOLT-LMM-inf and GCTA-LOCO are expected to differ slightly because GCTA-LOCO is the 
standard prospective statistic whereas BOLT-LMM-inf is a retrospective statistic. Data shown are 
from one simulation using the setup of Fig. 3 (real genotypes from the WTCCC2 data set, 
simulated phenotypes with 5060 causal SNPs explaining 0.52 of phenotypic variance). To reduce 
clutter, 5% of the SNPs are plotted. 
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Supplementary Table 2. BOLT-LMM increases power to detect associations in simulations 



(b) 



(c) 



-^^causal 


PCA BOLT-LMM-inf BOLT-LMM 


1250 
2500 
5000 
10000 


6.55 (0.07) 6.92(0.07) 8.71 (0.08) 
6.52 (0.07) 6.89 (0.08) 7.65 (0.09) 

6.56 (0.07) 6.94 (0.07) 7.23 (0.07) 
6.56 (0.07) 6.94 (0.07) 7.04 (0.07) 




U2 

' ' causal 


PCA BOLT-LMM-inf BOLT-LMM 


0.3 
0.5 
0.7 


6.41 (0.07) 6.54(0.07) 6.71 (0.07) 
6.52 (0.07) 6.89 (0.08) 7.65 (0.09) 
6.63(0.08) 7.39(0.09) 9.71 (0.11) 




N 


PCA BOLT-LMM-inf BOLT-LMM 


5211 
10422 
15633 


2.83 (0.04) 2.88 (0.04) 2.90 (0.04) 
4.71 (0.06) 4.90(0.06) 5.14(0.06) 
6.52 (0.07) 6.89 (0.08) 7.65 (0.09) 



These tables provide the numerical data (mean x 2 statistics at candidate causal SNPs) plotted in 
Fig. 2. 
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Supplementary Table 3. Calibration of BOLT-LMM statistics in simulations with varying 
genetic architectures and sample sizes 



-^^causal 


PCA BOLT-LMM-inf BOLT-LMM 


1250 
2500 
5000 
10000 


1.009(0.001) 1.006(0.001) 1.005 (0.001) 
1.008(0.001) 1.005(0.001) 1.004(0.001) 
1.007(0.001) 1.005(0.001) 1.004(0.001) 
1.009 (0.001) 1.007 (0.001) 1.007 (0.001) 



(b) 



U2 

' ' causal 


PCA BOLT-LMM-inf BOLT-LMM 


0.3 
0.5 
0.7 


1.005 (0.001) 1.004 (0.001) 1.005 (0.001) 
1.008 (0.001) 1.005 (0.001) 1.004 (0.001) 
1.011 (0.001) 1.009(0.001) 1.000(0.001) 



(c) 



N 


PCA BOLT-LMM-inf BOLT-LMM 


5211 
10422 
15633 


1.002(0.001) 1.004(0.001) 1.004(0.001) 
1.005 (0.001) 1.005(0.001) 1.005 (0.001) 
1.008(0.001) 1.005(0.001) 1.004(0.001) 



We report mean x 2 statistics at null SNPs in the simulations of Fig. 2. The very slight inflation of 
BOLT-LMM and BOLT-LMM-inf arises from the use of approximate variance parameter 
estimates and from the fact that standard mixed model methods eliminate the effects of 
stratification nearly perfectly but not completely (Supplementary Table 4). 
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Supplementary Table 4. Calibration of BOLT-LMM statistics using different variance 
component estimation procedures 



Variance component 


Leave-out 


h' 2 

<<• strat 


= 0.01 




K 1 


strat — 0 




estimation procedure 


segments 


GCTA-LOCO 


BOLT-inf 


BOLT 


GCTA-LOCO 


BOLT-inf 


BOLT 


Once with no left-out 


22 


NA 


1.005 


1.004 


NA 


1.004 


1.002 


SNPs 


100 


NA 


1.002 


1.000 


NA 


1.001 


0.999 


Independently for each 


22 


1.002 


1.001 


0.999 


1.000 


0.999 


0.997 


leave-out segment 


100 


NA 


1.000 


0.998 


NA 


0.998 


0.997 



Mean GCTA-LOCO, BOLT-LMM-inf, and BOLT-LMM chi-squared statistics at null SNPs 
computed using different methods for variance component analysis and different numbers of 
leave-out segments. We used the same simulation setup as in Fig. 2 (real genotypes from the 
WTCCC2 data set, simulated phenotypes with 2560 causal SNPs explaining 0.52 of phenotypic 
variance). We performed two sets of simulations: one with environmental stratification h 2 strat 
explaining 0.01 of the variance, and the other with no environmental stratification. Thus, the top 
left entries of 1.005 for BOLT-LMM-inf and 1.004 for BOLT-LMM correspond to the calibration 
results in Supplementary Table 3 for M causal = 2500, h 2 causal = 0.5, iV = 15,633 (appearing in each 
of Supplementary Table 3a,b,c). Data shown are from 100 simulations, which gave a standard 
error of 0.001 for all numbers above. By default, the BOLT-LMM algorithm estimates variance 
parameters once using all SNPs and then reuses the variance estimates when computing 
association statistics, leaving each of the 22 chromosomes out in turn. This procedure is an 
approximation of the theoretically precise method of re-estimating variance parameters once per 
leave-out segment, which the BOLT-LMM software offers as an alternative option (Online 
Methods). The quality of the approximation can also be improved by subdividing chromosomes 
for leave-out analysis; we compare results with 22 vs. 100 leave-out segments, which the 
BOLT-LMM software allows as well (Online Methods). GCTA-LOCO re-estimates variance 
components once for each of 22 left-out chromosomes. These simulations demonstrate that very 
slight (0.1-0.5%) inflation of x 2 test statistics can stem from two causes: (1) reusing the same 
variance parameter estimates across all LOCO reps rather than re-estimating them for each LOCO 
rep; and (2) near-complete but imperfect correction for stratification by mixed model 
methods [50]. The first source of slight inflation is specific to the BOLT-LMM approximation 
procedure but can be reduced by partitioning the genome into finer leave-out segments or 
eliminated by refitting variance parameters for each LOCO rep at a small increase in 
computational cost (Online Methods). This inflation does not increase with sample size 
(Supplementary Table 3c). The second source of slight inflation is shared by all mixed model 
methods, as evidenced by the 0.2% inflation of GCTA-LOCO in the h 2 strat = 0.01 simulation, and 
scales with sample size and proportion of variance explained by ancestry in the same manner as 
inflation caused by genuine polygenic effects (see Supplementary Table 2 of ref. [12]). To 
completely eliminate this source of very slight inflation, it is necessary to either include principal 
components as fixed effects or include a second variance component that models 
ancestry [12,50-52]. However, we believe that this very slight inflation is not a significant 
problem for standard mixed model methods or BOLT-LMM. 
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Supplementary Table 5. Type I error of BOLT-LMM and EMMAX association tests in 
simulations 



Method 


a = 10 


-2 


a = 10- 


4 


a = 10-° 


a = 5 x 10~ 8 


BOLT-LMM 


1.02 x 10~ 2 


705103 


1.04 x 10~ 4 


7221 


1.05 x 10~ e 


73 


4 x 10" s 


3 


BOLT-LMM-inf 


1.02 x 10~ 2 


707636 


1.03 x 10~ 4 


7119 


1.07 x 10~ 6 


74 


6 x 10~ 8 


4 


EMMAX 


0.63 x 10~ 2 


434081 


0.39 x 10~ 4 


2673 


0.32 x 10~ 6 


22 


0 x 10" 8 


0 



Type I error rates and counts for test statistics at null SNPs in the simulations of Fig. 3 (real 
genotypes from the WTCCC2 data set, simulated phenotypes with causal SNPs explaining 52% 
of phenotypic variance and environmental stratification explaining 1% of phenotypic variance). 
To increase statistical resolution, we combined data from all 400 simulations (with 1310, 2560, 
5060, and 10060 causal SNPs, 100 simulations each) for a total of 69,293,600 hypothesis tests. 
Type I error was statistically indistinguishable between different subsets of 100 simulations. The 
very slight upward bias of actual vs. expected Type I error rates for the BOLT-LMM-inf and 
BOLT-LMM association tests is explained by very slight (0.1-0.5%) inflation of \ 2 test statistics 
(Supplementary Table 4), which can be mitigated by performing a more precise variance 
parameter calculation if desired (Online Methods). 
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Supplementary Table 6. Correlations between mixed model statistics computed by various 
mixed model methods under the infinitesimal model 



R 2 (s.e.m.) 


GCTA-LOCO GEMMA 


BOLT-LMM-inf 
EMMAX 


0.999471 (0.000017) 0.962928 (0.000916) 
0.963326 (0.000969) 0.999997 (0.000000) 



Squared correlation coefficients between x 2 association statistics over all SNPs computed by 
various mixed model methods using the simulation setup of Fig. 3 (real genotypes from the 
WTCCC2 data set, simulated phenotypes with 5060 causal SNPs explaining 0.52 of phenotypic 
variance). Means and standard errors over 5 simulations are reported. BOLT-LMM-inf and 
GCTA-LOCO statistics both avoid proximal contamination via LOCO analysis but are expected 
to differ slightly because GCTA-LOCO is the standard prospective statistic whereas 
BOLT-LMM-inf is a retrospective statistic. EMMAX and GEMMA are both susceptible to 
proximal contamination but are expected to differ slightly because GEMMA is an exact statistic 
whereas EMMAX is an approximate statistic. 
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Supplementary Table 7. Comparison of x 2 statistics computed by various methods at 
known SNPs for WGHS phenotypes 



Phenotype 


# known 
SNPs 


# tagged 
SNPs 


#p< 0.05 
PCA & LMM 


Mean log fold 
X 2 LMM / PCA 


#p < 0.05 
PCA & BOLT 


Mean log fold 
X 2 BOLT / PCA 


ApoB 


59 (ref. [53]) 


57 


40 


2.3 (1.4) 


40 


9.9 (2.1) 


LDL 


59 (ref. [53]) 


57 


45 


1.4(1.1) 


45 


8.7 (1.7) 


HDL 


72 (ref. [53]) 


72 


55 


3.4(1.2) 


55 


9.4(1.8) 


Cholesterol 


73 (ref. [53]) 


71 


53 


1.8(1.1) 


54 


5.5(1.2) 


Triglycerides 


41 (ref. [53]) 


39 


29 


-0.6 (1.8) 


29 


1.7 (2.0) 


Height 


180 (ref. [54]) 


170 


131 


6.0(1.5) 


134 


6.2(1.6) 


BMI 


32 (ref. [55]) 


31 


18 


4.2(1.8) 


18 


3.8(1.8) 


SystolicBP 


20 (ref. [56]) 


20 


11 


2.6(1.9) 


11 


2.6(1.9) 


DiastolicBP 


19 (ref. [56]) 


19 


9 


3.2(1.8) 


9 


3.0 (2.0) 



The first two data columns provide additional information about known SNPs used in the power 
comparison of Supplementary Table 8: # known SNPs, number of genome-wide significant 
associated SNPs reported in largest GWAS to date; # tagged SNPs, number of such SNPs with an 
R 2 > 0.2 tagging SNP typed in WGHS. (Note that for the ApoB phenotype, we used known 
SNPs for the closely related LDL phenotype.) Sums of x 2 statistics compared in Supplementary 
Table 8 were computed across the WGHS-typed tagging SNPs. 

We also report here one additional metric for increase of power: Mean log fold-change in x 2 
statistics at tagging SNPs. This metric weights all tagging SNPs evenly (Supplementary Table 8), 
whereas comparing sums of x 2 statistics weights stronger associations more heavily. Because log 
fold-change is sensitive to noise from non-replicating SNPs, we restrict to tagging SNPs with at 
least nominal significance (p < 0.05) according to both methods being compared. Methods: 
PCA, linear regression using 10 principal components as covariates; LMM, BOLT-LMM-inf; 
BOLT, BOLT-LMM. Errors, s.e.m. 
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Supplementary Table 8. BOLT-LMM increases power to detect associations for WGHS 
phenotypes 









Prediction R 2 




X 2 incr. at known loci 


Phenotype 


u2 

l & pseudo 


PCA (%) 


LMM (« 


Jo) 


BOLT (%) 


p-value 


LMM vs. 


BOLT vs. 








Actual, Expected 




BOLT>PCA 


PCA (%) 


PCA (%) 


ApoB 


0.24 


0.0 


1.5(0.1), 


1.7 


10.4 (0.3) 


4.3E-6 


1.5 (0.6) 


10.0(1.2) 


LDL 


0.18 


0.1 


1.0 (0.1), 


1.0 


7.5 (0.2) 


2.8E-6 


1.4 (0.6) 


8.1 (0.8) 


HDL 


0.22 


0.1 


1.7 (0.1), 


1.5 


6.7 (0.2) 


2.3E-5 


1.7(1.5) 


7.1 (2.2) 


Cholesterol 


0.20 


0.1 


1.1(0.1), 


1.1 


5.9 (0.1) 


1.3E-6 


1.7 (0.7) 


6.9 (0.8) 


Triglycerides 


0.18 


0.0 


1.2 (0.1), 


1.0 


4.4 (0.3) 


0.00014 


0.4(1.0) 


3.1 (1.2) 


Height 


0.47 


1.9 


7.7 (0.4), 


6.4 


9.5 (0.3) 


1.9E-6 


6.4(1.3) 


7.6(1.4) 


BMI 


0.23 


0.0 


1.6 (0.1), 


1.5 


2.1 (0.1) 


8.5E-5 


1.3 (1.4) 


1.3 (1.5) 


SystolicBP 


0.17 


0.2 


1.0 (0.1), 


0.8 


1.0(0.1) 


0.00014 


3.6 (2.2) 


3.6 (2.2) 


DiastolicBP 


0.14 


0.0 


0.7 (0.0), 


0.5 


0.7 (0.0) 


0.00017 


2.9 (1.7) 


2.7(1.6) 



We report relative power of different association tests using two roughly equivalent metrics: 
comparison of x 2 statistics at known loci, a direct but noisy approach, and out-of-sample 
prediction R 2 based on the underlying model (both plotted in Figure 3. For reference, we also 
report pseudo-heritability (/i 2 pse udo), as estimated by BOLT-LMM, and expected prediction R 2 for 
LMM [31]. Methods: PCA, linear regression using 10 principal components; LMM, standard 
(infinitesimal) mixed model; BOLT, BOLT-LMM Gaussian mixture model. Actual prediction R 2 
values are from 5-fold cross-validation: predictions for each left-out fold were computed by 
fitting all SNP effects simultaneously (for mixed model methods) or estimating covariate effects 
(for PCA) using the training folds. Note in particular that for PCA, only covariate effects, not 
SNP effects, were used for prediction. Standard errors are across folds. Expected prediction R 2 
for LMM was computed using N = 23, 294 x 4/5 (taking into account the 5-fold cross-validation), 
M e ff = 60,000, and h 2 = h 2 pseudo estimated by BOLT-LMM, given that the WGHS data set 
contains little relatedness. (Note that for height, actual prediction R 2 for LMM is slightly higher 
than expected because PCs explain a non-negligible amount of variance.) Significance for 
BOLT-LMM>PCA prediction R 2 was assessed using a one-sided paired t-test across folds. 
Percent increases in x 2 statistics computed by various methods across known loci are comparisons 
between sums of x 2 statistics over typed SNPs in highest LD with published associated SNPs 
from the largest GWAS to date (Supplementary Table 7). Standard errors are jackknife estimates. 
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Supplementary Table 9. Effect size mixture parameters chosen by BOLT-LMM for WGHS 
phenotypes 



Phenotype 


h 2 

' * pseudo 


h 


V 


ApoB 


0.24 


0.1 


0.01 


LDL 


0.18 


0.1 


0.01 


HDL 


0.22 


0.1 


0.01 


Cholesterol 


0.20 


0.1 


0.01 


Triglycerides 


0.18 


0.1 


0.01 


Height 


0.47 


0.3 


0.02 


BMI 


0.23 


0.5 


0.01 


SystolicBP 


0.17 


0.5 


0.5 


DiastolicBP 


0.14 


0.5 


0.01 



We report the best-fit mixture-of-Gaussians prior on SNP effect sizes determined by BOLT-LMM 
using cross-validation to optimize out-of- sample prediction R 2 . The spike and slab mixture of 
Gaussians is parameterized by the total variance attributed to the combined Gaussian mixture 
along with two mixture parameters: f 2 , the proportion of variance allotted to the spike component 
(small-effect SNPs), and p, the probability that a SNP effect is drawn from the slab component 
(large-effect SNPs). Note that f 2 = 0.5, p = 0.5 corresponds to the infinitesimal model: when 
f 2 = l—p, the two Gaussians are identical and the mixture is degenerate. 
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Supplementary Table 10. Calibration of x 2 statistics computed by various methods for 
WGHS phenotypes 









Mean x 2 








Aqc 




Phenotype 


LR 


PCA 


LMM 


BOLT-LMM 


LR 


PCA 


LMM 


BOLT-LMM 


ApoB 


1.110 


1.098 


1.103 


1.109 


1.061 


1.049 


1.055 


1.054 


LDL 


1.100 


1.079 


1.084 


1.091 


1.055 


1.039 


1.042 


1.045 


HDL 


1.108 


1.096 


1.100 


1.100 


1.073 


1.066 


1.068 


1.065 


Cholesterol 


1.108 


1.087 


1.090 


1.094 


1.063 


1.044 


1.045 


1.046 


Triglycerides 


1.098 


1.092 


1.089 


1.090 


1.069 


1.059 


1.058 


1.057 


Height 


1.565 


1.212 


1.227 


1.232 


1.503 


1.159 


1.170 


1.169 


BMI 


1.104 


1.101 


1.099 


1.099 


1.105 


1.101 


1.100 


1.098 


SystolicBP 


1.102 


1.063 


1.076 


1.076 


1.102 


1.061 


1.074 


1.074 


DiastolicBP 


1.073 


1.056 


1.062 


1.062 


1.074 


1.058 


1.061 


1.062 



Mean x 2 test statistics across all SNPs (left columns) and genomic inflation factors A G c (right 
columns) using four methods: LR, linear regression; PCA, linear regression using 10 principal 
components as covariates; LMM, BOLT-LMM-inf; and BOLT-LMM. In all cases, both mean x 2 
and Agc exceed 1 because of polygenicity. PCA, LMM, and BOLT-LMM are consistently 
calibrated, whereas uncorrected linear regression suffers inflation due to population stratification, 
especially for height. 
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Supplementary Table 11. Control for stratification: Height /^-values at lactase computed by 
various methods 





LR PCA LMM BOLT-LMM 


X 2 statistic 
p-value 


33.09 3.66 6.56 6.07 
9 x 10~ 9 0.06 0.01 0.01 



We report x 2 statistics and p- values computed at rs201 1946, the SNP typed in WGHS with 
highest LD (R 2 = 0.64 in 1000 Genomes reference samples) to lactase-associated SNP 
rs4988235. Methods: LR, linear regression; PCA, linear regression using 10 principal 
components as covariates; LMM, BOLT-LMM-inf; and default BOLT-LMM. 
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Supplementary Table 12. Power of mixed model association vs. linear regression for 
simulated case-control ascertained traits 



(a) h 2 = 0.25 (for liabilities underlying case-control phenotype) 



Method 


Case prevalence 

50% 10% 5% 1% 0.1% 


Linear regression 

BOLT-LMM-inf 

BOLT-LMM 


2.828(0.010) 3.723 (0.015) 4.401 (0.015) 6.206(0.020) 9.179(0.026) 
2.862(0.010) 3.739(0.015) 4.384(0.015) 5.965(0.018) 8.212(0.023) 
2.887 (0.011) 3.767 (0.015) 4.397 (0.014) 5.859(0.022) 7.869 (0.025) 



(b) h 2 — 0.5 (for liabilities underlying case-control phenotype) 



Method 


Case prevalence 

50% 10% 5% 1% 0.1% 


Linear regression 

BOLT-LMM-inf 

BOLT-LMM 


4.661 (0.014) 6.452(0.021) 7.756(0.023) 11.360(0.030) 17.290(0.043) 
4.904(0.016) 6.582(0.022) 7.662(0.021) 10.202(0.025) 13.252(0.033) 
5.165 (0.018) 6.798 (0.026) 7.826(0.024) 9.816(0.034) 12.326(0.037) 



We compare mean \ 2 statistics over causal SNPs for linear regression, BOLT-LMM-inf, and 
BOLT-LMM analysis of simulated case-control traits with prevalences ranging from 50%-0.1%. 
We simulated genotypes by generating individuals as mosaics of up to 100 random "ancestors" 
from the WTCCC2 data set, resampling ancestors every 500 SNPs. We restricted the SNP set to 
the first 2,500 SNPs on each autosome, for a total of M = 55,000 SNPs (so that M effe ctive ~ 10,000 
independent SNPs). We simulated case-control phenotypes using a liability threshold model in 
which we first generated continuous phenotypes with (a) h 2 = 0.25 or (b) h 2 g = 0.5 explained by 
M causa i = 1000 markers and then defined cases as individuals with phenotypes exceeding a 
threshold corresponding to the desired prevalence. In each simulation, we ascertained 5,000 cases 
and 5,000 controls for a total of N = 10,000 simulated individuals. We then ran association 
analysis on the ascertained samples. Errors, s.e.m. over 100 simulations. 

The results show that for non-ascertained traits, power of linear regression < BOLT-LMM-inf < 
BOLT-LMM, consistent with our findings for quantitative traits. This trend reverses for 
ascertained traits at a prevalence near 5%. For the infinitesimal mixed model vs. linear regression, 
this result is consistent with the findings of ref. [12], which also observes that the relative 
reduction in test statistics increases with h 2 g and the ratio N / M effective in addition to ascertainment 
severity. In our simulations, N « M e ff ec tive> so these simulations correspond to N « 60,000 in real 
data sets, which have Meffective ~ 60,000 [12]. For data sets with fewer than 60,000 samples, 
case-control ascertainment will present less of a problem for mixed model methods than indicated 
above. 
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Supplementary Table 13. Precision of infinitesimal mixed model statistic calibration 



Phenotype 


Mean prospective stat 


Mean retrospective stat 


Calibration factor 






(pre-calibration) 


cm, equation (9) 


ApoB 


1.285 


1.281 


0.996 (0.001) 


LDL 


1.013 


l .01 1 


0.998 (0.000) 


HDL 


1.026 


1.024 


0.997 (0.001) 


Cholesterol 


0.989 


0.986 


0.997 (0.000) 


Triglycerides 


0.881 


0.879 


0.998 (0.000) 


Height 


0.783 


0.777 


0.993 (0.001) 


BMI 


0.900 


0.897 


0.997 (0.00 1) 


Systolic BP 


0.853 


0.849 


0.996 (0.001) 


Diastolic BP 


0.770 


0.769 


0.998 (0.001) 



We report the BOLT-LMM-inf calibration factor for the nine analyzed WGHS phenotypes. This 
calibration is similar to GRAMMAR- Gamma [10] but is estimated by computing statistics at only 
30 SNPs. The standard errors (jackknife estimates) show that 30 SNPs are enough to achieve high 
calibration precision. 
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